AFRL-SN-WP-TR-2005-1108 


DARPA  INTEGRATED  SENSING  AND 
PROCESSING  (ISP)  PROGRAM 
Approximation  Methods  for  Markov  Decision 
Problems  in  Sensor  Management 


Michael  K.  Schneider,  Ph.D. 

Angelia  Nedich,  Ph.D. 

Prof.  David  Castanon 
Bob  Washburn,  Ph.D. 

BAE  Systems 

Advanced  Information  Technology 
6  New  England  Executive  Park 
Burlington,  MA  01803 


JUNE  2006 

Final  Report  for  01  July  2002  -  30  June  2006 


Approved  for  public  release;  distribution  is  limited. 


STINFO  COPY 

Appendices  2  through  4,  resulting  from  Air  Force  contract  number  F33615-02-C-1 197,  have 
been  submitted  for  publication  in  various  conference  proceedings.  If  published,  the  United 
States  has  for  itself  and  others  acting  on  its  behalf  an  unlimited,  paid-up,  nonexclusive, 
irrevocable,  worldwide  license.  Any  other  form  of  use  is  subject  to  copyright  restrictions. 


SENSORS  DIRECTORATE 
AIR  FORCE  RESEARCH  LABORATORY 
AIR  FORCE  MATERIEL  COMMAND 
WRIGHT-PATTERSON  AFB,  OH  45433-7320 


NOTICE 


Using  Government  drawings,  specifications,  or  other  data  included  in  this  document  for  any 
purpose  other  than  Government  procurement  does  not  in  any  way  obligate  the  U.S.  Government. 
The  fact  that  the  Government  fonnulated  or  supplied  the  drawings,  specifications,  or  other  data 
does  not  license  the  holder  or  any  other  person  or  corporation;  or  convey  any  rights  or 
pennission  to  manufacture,  use,  or  sell  any  patented  invention  that  may  relate  to  them. 


This  report  was  cleared  for  public  release  by  the  Air  Force  Research  Laboratory  Wright  Site 
(AFRL/WS)  Public  Affairs  Office  (PAO)  and  is  releasable  to  the  National  Technical 
Information  Service  (NTIS).  It  will  be  available  to  the  general  public,  including  foreign 
nationals. 

PAO  Case  Number:  AFRL/WS-06-0157,  19  Jan  2006 


THIS  TECHNICAL  REPORT  IS  APPROVED  FOR  PUBLICATION. 


/  /Signature/  / 


/  /Signature/  / 


Alan  D.  Kerrick,  Project  Engineer 
RF  Systems  and  Analysis  Branch 
RF  Technology  Division 


Keith  W.  Loree,  Branch  Chief 
RF  Systems  and  Analysis  Branch 
RF  Technology  Division 


/  /Signature/  / 


Timothy  R.  Poth,  Major,  USAF 
Deputy  Division  Chief 
RF  Sensor  Technology  Division 


This  report  is  published  in  the  interest  of  scientific  and  technical  information  exchange  and  its 
publication  does  not  constitute  the  Government’s  approval  or  disapproval  of  its  ideas  or  findings. 


REPORT  DOCUMENTATION  PAGE 


Form  Approved 
OMB  No.  0704-0188 


The  public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  searching  existing  data 
sources,  gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Department  of  Defense,  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports  (0704-0188),  1215  Jefferson  Davis  Highway, 

Suite  1204,  Arlington,  VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  information  if 
it  does  not  display  a  currently  valid  OMB  control  number.  PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS. 


1.  REPORT  DATE  (DD-MM-YY)  2.  REPORT  TYPE  3.  DATES  COVERED  (From  -  To) 

June  2006  Final  07/01/2002-06/30/2006 


4.  TITLE  AND  SUBTITLE 

DARPA  INTEGRATED  SENSING  AND  PROCESSING  (ISP)  PROGRAM 

5a.  CONTRACT  NUMBER 

F33615-02-C-1 197 

Approximation  Methods  for  Markov  Decision  Problems  in  Sensor  Management 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

69199F 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

Michael  K.  Schneider,  Ph.D. 

ARPS 

Angelia  Nedich,  Ph.D. 

5e.  TASK  NUMBER 

Prof.  David  Castanon 

NR 

Bob  Washburn,  Ph.D. 

5f.  WORK  UNIT  NUMBER 

ou 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

8.  PERFORMING  ORGANIZATION 

BAE  Systems 

REPORT  NUMBER 

Advanced  Information  Technology 

6  New  England  Executive  Park 

TR-1620 

Burlington,  MA  01803 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSORING/MONITORING  AGENCY 
ACRONYM(S) 

Sensors  Directorate 

D  ARP  A/D  SO 

AFRL-SN-WP 

Air  Force  Research  Laboratory 

3701  N.  Fairfax  Dr. 

11.  SPONSORING/MONITORING  AGENCY 

Air  Force  Materiel  Command 

Arlington,  VA  22203 

REPORT  NUMBER(S) 

Wright-Patterson  AFB,  OH  45433-7320 

AFRL-SN-WP-TR-2005-1 108 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  is  unlimited. 


13.  SUPPLEMENTARY  NOTES 

Report  contains  color.  PAO  Case  Number:  AFRL/WS-06-0157,  19  Jan  2006. 

Appendices  2  through  4,  resulting  from  Air  Force  contract  number  F33615-02-C-1 197,  have  been  submitted  for  publication 
in  various  conference  proceedings.  If  published,  the  United  States  has  for  itself  and  others  acting  on  its  behalf  an  unlimited, 
paid-up,  nonexclusive,  irrevocable,  worldwide  license.  Any  other  form  of  use  is  subject  to  copyright  restrictions. 

14.  ABSTRACT 

This  work  addresses  problems  of  sensor  resource  management  (SRM)  in  which  one  or  more  sensors  obtain  measurements 
of  the  state  of  one  or  more  targets.  For  example,  an  airborne  radar  may  be  attempting  to  track  several  ground  targets,  which 
are  sometimes  stationary  (requiring  a  synthetic  aperture  radar  mode)  and  sometimes  moving  (requiring  a  ground  moving 
target  indication  radar  mode).  The  challenge  is  to  schedule  the  radar  modes  as  the  scenario  evolves.  Such  problems  can 
generally  be  formulated  as  partially  observable  Markov  decision  processes  (POMDPs),  which  can  express  essential 
characteristics  of  the  SRM  problem  such  as  uncertainty  and  dynamics.  This  work  emphasizes  a  farsighted  approach;  the 
highest  long-term  payoff  may  not  be  generated  by  the  action  providing  the  highest  immediate  payoff.  Accomplishments  of 
this  effort  include  the  establishment  of  a  boundary  on  optimal  SRM  performance,  analysis  of  farsighted  SRM  strategies  for 
controlling  a  multimode  sensor,  and  the  derivation  of  a  novel  set  of  sufficient  conditions  for  optimality  in  Markov  decision 
processes. 


15.  SUBJECT  TERMS 

integrated  sensing  and  processing,  sensor  resource  management,  partially  observable  Markov  decision  processes,  farsighted 
sensor  management,  radar 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION 

18.  NUMBER 

19a.  NAME  OF  RESPONSIBLE  PERSON  (Monitor) 

a.  REPORT 

b.  ABSTRACT 

c.  THIS  PAGE 

OF  ABSTRACT: 

OF  PAGES 

Alan  D.  Kerrick 

Unclassified 

Unclassified 

Unclassified 

SAR 

120 

19b.  TELEPHONE  NUMBER  (Include  Area  Code) 

(937)  255-6427,  ext.  4343 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std.  Z39-18 


i 


TABLE  OF  CONTENTS 


1  LIST  OF  FIGURES . iv 

2  LIST  OF  TABLES . v 

3  EXECUTIVE  SUMMARY . 1 

3 . 1  PROGRAM  OVERVIEW . 1 

3.2  ACCOMPLISHMENTS . 5 

4  FARSIGHTED  ALGORITHMS  FOR  CONTROLLING  SENSOR  MODES . 7 

4.1  INTRODUCTION . 7 

4.2  BASIC  PRINCIPLES  OF  ROLLOUT  ALGORITHMS . 7 

4.3  SRM  ALGORITHMS  FOR  MOVE-STOP  TRACKING . 9 

4.4  SIMULATION  RESULTS . 23 

4.5  CONCLUSIONS . 45 

5  COMPUTABLE  OPTIMAL  STRATEGIES . 46 

5. 1  SUFFICIENT  CONDITION . 47 

5 . 2  APPLICATIONS  TO  SRM  PROBLEMS . 52 

5.3  BIRTH-DEATH  MDPS . 56 

5.4  BINARY  CLASSIFICATION  PROBLEM . 58 

5 . 5  SEARCH  PROBLEM . 63 

6  REFERENCES . 71 


iii 


1  LIST  OF  FIGURES 


Figure  1:  BAE  AIT’s  ISP  project  focus  on  problems  of  SRM,  in  which  sensors  are  dynamically 
managed  in  closed  loop  to  improve  the  quality  of  the  data  provided  to  the  fusion  system.... 2 

Figure  2:  A  rollout  approach  to  evaluate  near  and  far  future  benefits  of  an  action  taken  at  the 


current  state . 8 

Figure  3:  SRM  Architecture . 11 

Figure  4:  Discrete-time  Markov  chain  modeling  target  motion . 24 


Figure  5:  The  plots  show  results  for  the  precision  maximizing  SRM  when  10  targets  are  stopped 
on  average.  The  top  plot  shows  the  true  target  motion . 28 

Figure  6:  The  plots  show  results  for  the  precision  maximizing  SRM  when  20  targets  are  stopped 
on  average.  The  top  plot  shows  the  true  target  motion . 29 


Figure  7:  The  plots  show  results  for  the  precision  maximizing  SRM  when  30  targets  are  stopped 
on  average.  The  top  plot  shows  the  true  target  motion . 30 

Figure  8:  The  plots  show  results  for  the  precision  maximizing  SRM  when  40  targets  are  stopped 
on  average.  The  top  plot  shows  the  true  target  motion . 31 


Figure  9:  The  plots  show  results  for  the  error  minimizing  SRM  when  10  targets  are  stopped  on 
average.  The  top  plot  shows  the  true  target  motion . 33 

Figure  10:  The  plots  show  results  for  the  error  minimizing  SRM  when  20  targets  are  stopped  on 
average.  The  top  plot  shows  the  true  target  motion . 34 


Figure  11:  The  plots  show  results  for  the  error  minimizing  SRM  when  30  targets  are  stopped  on 
average.  The  top  plot  shows  the  true  target  motion . 35 

Figure  12:  The  plots  show  results  for  the  error  minimizing  SRM  when  40  targets  are  stopped  on 
average.  The  top  plot  shows  the  true  target  motion . 36 


Figure  13:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  10  targets  are 

stopped  on  average.  The  top  plot  shows  the  true  target  motion . 38 

Figure  14:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  20  targets  are 

stopped  on  average.  The  top  plot  shows  the  true  target  motion . 39 

Figure  15:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  30  targets  are 

stopped  on  average.  The  top  plot  shows  the  true  target  motion . 40 

Figure  16:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  40  targets  are 

stopped  on  average.  The  top  plot  shows  the  true  target  motion . 41 

Figure  17:  The  results  for  the  time  averaged  mean-square  error  obtained  for  the  four  target- 
motion  scenarios  and  for  the  three  SRM  algorithms . 43 

Figure  18:  The  results  for  the  average  fraction  of  time  the  target  error  goals  are  met  obtained  for 
the  four  target-motion  scenarios  and  for  the  three  SRM  algorithms . 44 


Figure  19:  In  example  1,  the  state  of  the  target  remains  unchanged  if  it  is  not  observed,  but  the 
state  may  change,  as  indicated  in  the  illustration,  if  the  target  is  observed . 58 


IV 


2  LIST  OF  TABLES 


Table  1 :  Tracker  model  parameter  values  used  in  our  simulations . 26 

Table  2:  Sensor  model  parameter  values  used  in  our  simulations . 26 

Table  3:  Dynamic  programming  parameter  values  used  in  our  simulations . 26 

Table  4:  The  relation  for  the  transition  probabilities  Psm  and  Pms  as  the  average  number  of  targets 
stopped  (in  the  steady  state)  takes  values  10,  20,  30,  and  40 . 27 


V 


3  EXECUTIVE  SUMMARY 


3.1  PROGRAM  OVERVIEW 


The  BAE  SYSTEMS  Advanced  Information  Technology  (BAE  AIT)  Integrated  Sensing  and 
Processing  (ISP)  project  was  analyzing  and  developing  algorithms  to  solve  the  problem  of 
closed-loop  sensor  resource  management  (SRM),  as  illustrated  in  Figure  1.  These  problems 
consist  of  managing  one  or  more  sensors  to  obtain  measurements  of  the  state  of  one  or  more 
targets.  The  measurements  are  then  fused  to  estimate  the  states  of  the  targets.  The  process  of 
fusing  the  measurements  reduces  the  uncertainty  in  the  measurements  and  combines 
measurements  to  provide  complementary  infonnation.  The  measurement  collection  can  be 
managed  at  several  levels,  but  it  is  natural  to  group  these  into  three  categories:  collection 
management,  sensor  management,  and  dwell  management.  A  collection  manager  controls  from 
what  locations  measurements  are  made  by  placing  the  sensors  in  those  locations.  A  sensor 
manager  controls  what  is  measured  by  determining  which  sensors  to  use  and  by  adjusting  coarse 
control  parameters  such  as  pointing  angle  and  operating  mode  on  an  individual  sensor.  A  dwell 
manager  controls  how  a  measurement  is  made  by  adjusting  sensor  characteristics  such  as 
transmitted  power  on  an  active  sensor.  The  focus  of  the  BAE  AIT  project  was  on  sensor 
management. 

The  potential  applications  of  sensor  management  are  numerous  and  include  ones  in  the  fields  of 
computer  network  security;  environmental  monitoring,  and  air-to-ground  intelligence, 
surveillance,  and  reconnaissance  (ISR). 

•  In  computer  network  security  applications  [1],  the  objective  is  to  monitor  the  state  of  a 
machine  to  determine  if  it  is  being  attacked  or  has  been  accessed  by  an  intruder.  Multiple 
processes  are  operating  on  the  computer  simultaneously,  and  software  sensors  can 
monitor  the  activity  of  each  of  these  processes  as  well  as  the  aggregate  state  of  the 
machine.  The  potential  exists  to  apply  sensor  management  algorithms  to  dynamically 
control  where  and  how  measurements  are  made  in  the  system  to  optimize  the  detection 
process. 

•  In  environmental  monitoring  applications,  one  is  sensing  the  environment  to  determine 
the  distribution  of  one  or  more  environmental  resources  or  pollutants.  Potential  sensor 
locations  may  be  predetermined,  and  then  the  sensor  management  problem  is  one  of 
determining  which  sensor  sites  to  use  or  how  many  to  use.  An  example  would  be 
selecting  among  candidate  well  sites,  determining  the  ones  to  use  for  characterizing  the 
spread  of  a  pollutant  underground. 

•  In  air-to-ground  ISR  applications,  the  objective  is  to  find,  track  the  position  of,  and 
classify  ground  targets  with  airborne  sensors.  The  sensors  can  be  steered  to  look  at 
different  areas  on  the  ground.  For  a  fixed  sensor  platfonn  route,  the  SRM  problem  is 
then  one  of  detennining  where  to  collect  measurements  on  the  ground. 
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Figure  1:  BAE  AIT's  ISP  project  focus  on  problems  of  SRM,  in  which  sensors  are 
dynamically  managed  in  closed  loop  to  improve  the  quality  of  the  data  provided  to  the 

fusion  system 

Our  objective  in  analyzing  the  SRM  problem  is  to  develop  insights  applicable  to  a  broad  class  of 
applications  by  developing  formulations  in  a  general  mathematical  framework.  In  particular, 
SRM  problems  can  generally  be  fonnulated  as  partially  observable  Markov  decision  processes 
(POMDPs).  This  is  a  suitable  framework  because  it  can  express  essential  characteristics  of  the 
SRM  problem,  such  as  uncertainty  and  dynamics.  Uncertainty  is  present  in  the  SRM  problems 
because  the  data  being  collected  is  noisy.  Thus,  the  estimates  of  targets’  true  states  are  always 
uncertain,  and,  consequently,  the  predictions  of  measurements  that  would  result  from  future 
sensor  tasks  are  uncertain.  Dynamics  are  present  in  SRM  problems  because  one's  estimates  of 
true  target  states  are  constantly  changing.  This  occurs  for  two  reasons.  One  is  that  the  true 
states  of  the  targets  are  changing.  The  other  is  that  new  data  is  being  collected  over  time  and 
fused  to  form  new  estimates  of  target  states.  The  uncertainty  and  dynamics  can  be  modeled  in 
the  POMDP  framework  as  a  function  of  the  sensor  control  policy.  It  dictates  what  sensor  control 
actions  are  taken  in  response  to  different  estimates  of  target  states.  The  POMDP  framework  is 
rich  enough  to  model  uncertainties,  dynamics  and  control  options  in  a  broad  class  of  SRM 
problems. 

Given  a  POMDP  formulation  of  a  SRM  problem  including  the  objective  criteria  as  a  function  of 
the  data  collected,  dynamic  programming  can  be  used  to  compute  an  optimal  sensor  policy.  The 
dynamic  programming  algorithm  essentially  enumerates  all  possible  control  actions  and 
outcomes  starting  from  a  given  collection  of  historical  data  and  selects  the  sequence  of  controls 
that  optimizes  the  expected  outcome  as  measured  by  the  objective  criterion.  Dynamic 
programming  is  generally  difficult  to  implement  when  the  size  of  the  problem  is  large.  In  SRM, 
the  size  of  the  problem  is  dictated  by  the  set  of  control  options,  the  set  of  measurement  outcomes 
resulting  from  a  sensor  control,  and  the  set  of  possible  probability  distributions  of  target  states. 
The  sets  are  generally  large  and,  when  they  are  discrete,  scale  exponentially  with  the  number  of 
targets.  As  a  result,  computing  the  optimal  sensor  control  policy  with  dynamic  programming  is 
difficult  and  beyond  resources  of  current  and  foreseeable  computers. 
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Our  problem  was  to  analyze  a  class  of  SRM  problems  and  develop  algorithms  for  computing 
control  policies  which  are  computationally  efficient  and  near  optimal.  Although  POMDPs 
arising  in  SRM  problems  are  large  enough  that  a  standard  dynamic  programming  algorithm  will 
be  too  computationally  intensive,  the  SRM  problems  have  considerable  structure  that  can  be 
exploited  to  compute  good  sensor  control  policies.  In  particular,  the  targets  in  SRM  problems 
have  states  xiy  i=l,...,n  that  are  generally  independent.  Thus,  one  would  expect  that  the 
computation  of  a  sensor  policy  could  be  decomposed  into  a  set  of  computations  performed 
independently  on  each  individual  target.  However,  the  sensor  control  at  a  particular  time  is  not  a 
function  of  the  target  states  but  of  the  collection  of  information  on  each  target  Il(t)=\yl(ty):tv<t\ 
for  j=\  where  17(f)  is  a  measurement  made  of  target  j  at  time  tv.  Moreover,  the  information 
collections  I/t)  at  a  particular  time  are  not  independent  even  if  the  individual  measurements  V/  of 
the  different  targets  are  independent.  This  is  a  consequence  of  the  measurements  in  each  of  the 
collections  Ij(t)  being  selected  by  the  sensor  control  policy.  The  dependence  among  the 
information  collections  complicates  the  structure  of  the  problem.  Our  challenge  is  to  exploit  the 
structure  that  is  present  to  find  computable,  near-optimal  sensor  control  policies. 

Many  past  approaches  to  developing  computable  sensor  control  policies  have  focused  on  myopic 
approaches.  Myopic  approaches  select  sensor  control  actions  based  on  their  immediate  expected 
benefits  evaluated  over  the  duration  of  a  single  sensor  action.  These  policies  are  generally 
computationally  efficient.  However,  they  do  not  account  for  some  key  aspects  of  the  problem. 
In  particular,  they  do  not  account  for  dynamics  in  the  problem  other  than  the  immediate  state 
changes  resulting  from  the  next  control  action.  Thus,  myopic  policies  may  take  an  action  at  a 
point  in  time  that  has  immediate  benefits  but  would  be  better  to  postpone  to  a  more  opportunistic 
time  in  the  future  given  the  dynamics  of  the  problem.  Moreover,  myopic  policies  will  generally 
only  be  effective  if  the  reward  resulting  from  an  action  is  immediate.  If  the  rewards  are  not 
realized  until  the  future,  then  myopic  policies  will  not  work. 

In  contrast,  our  approaches  to  developing  sensor  control  policies  are  farsighted.  In  other  words, 
the  policy  accounts  for  dynamics  over  a  time  horizon  that  is  longer  than  it  takes  to  complete  the 
next  control  action.  Two  reasons  why  this  is  beneficial  are  that  it  will  appropriately  value  the 
placement  of  a  control  in  time  and  it  will  appropriately  value  actions  that  do  not  have  immediate 
benefits.  An  example  of  the  first  case  is  a  scenario  in  which  the  target  states  are  changing  in 
time  so  that  the  value  of  different  control  actions  is  changing  in  time.  A  farsighted  policy  will 
account  for  this  and  postpone  certain  sensor  actions  to  opportunistic  times  in  the  future  even  if 
there  are  some  immediate  benefits.  Such  a  policy  is  especially  beneficial  if  alternative  control 
actions  require  different  amounts  of  time  to  complete.  In  this  case,  the  evolution  of  the  target 
states  over  the  period  required  for  the  control  action  to  complete  will  be  significantly  different 
for  the  alternate  controls.  Selecting  the  best  sequence  of  controls  requires  accounting  for  these 
dynamics.  An  example  of  the  second  benefit  often  occurs  when  rewards  reflect  threshold 
objectives.  For  example,  one  may  want  to  achieve  a  particular  level  of  accuracy  in  the  state 
estimate.  The  reward  function  may  then  be  modeled  as  taking  the  value  0  if  the  objective  is  not 
met  and  1  if  it  is  met.  As  a  result,  there  may  be  no  immediate  benefits  to  taking  a  particular 
sensor  action.  A  farsighted  policy,  however,  would  value  a  particular  sensor  action  accounting 
for  the  potential  for  achieving  one's  objective  in  the  future. 

Our  approach  to  developing  farsighted  policies  that  are  computable  was  to  decompose  the 
problem  into  subproblems.  As  mentioned  previously,  the  individual  target  states  are  often 
modeled  as  independent.  Thus,  a  natural  approach  to  decomposing  the  computation  of  the  policy 
is  across  targets.  The  problem  of  computing  the  sensor  control  policy  is  thus  split  into  n  distinct 
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subproblems  where  n  is  the  number  of  targets.  Each  of  these  subproblems  is  formulated  in  terms 
of  the  infonnation  particular  to  an  individual  target.  This  approach  to  decomposing  the  problem 
has  the  potential  to  increase  the  efficiency  of  computing  the  sensor  control  policy  by  solving  the 
n  simpler  subproblems  rather  than  perfonning  the  computation  on  the  single  aggregate  problem 
that  includes  the  collection  of  infonnation  across  all  of  the  targets.  This  structure  will  not  always 
be  optimal.  However,  algorithms  having  this  structure  may  be  optimal  in  some  situations  and 
near-optimal  in  others.  Moreover,  some  algorithms  having  the  structure  may  not  yield  a  feasible 
sequence  of  sensor  controls  but  may  be  used  to  generate  a  bound  on  the  optimal  value  of  a  sensor 
management  problem.  We  have  specifically  investigated  approaches  to  computing  the  following 
three  quantities:  lower  bounds  on  the  optimal  sensor  management  perfonnance;  good, 
suboptimal  strategies  for  sensor  management;  and  optimal  sensor  management  strategies. 

1.  Lower  bounds  on  optimal  sensor  management  performance.  We  were  interested  in 
deriving  bounds  for  the  problem  of  dynamic  adaptive  scheduling  of  multi-mode  sensor 
resources  when  classifying  multiple  unknown  objects.  Sensor  schedules  are  adapted  based  on 
the  observed  data,  and  the  objective  is  specified  in  terms  of  a  terminal  cost.  We  were 
interested  in  comparing  the  performance  of  farsighted  and  myopic  strategies  on  such 
problems.  Lower  bounds  on  perfonnance  allow  us  to  determine  how  close  to  optimal 
candidate  strategies  may  be.  We  were  particularly  interested  in  deriving  bounds  based  on 
relaxations  for  which  the  optimal  policy  can  be  expressed  as  a  mixture  of  farsighted  sensor 
policies,  each  of  which  is  local  in  the  sense  that  it  only  depends  on  the  information 
concerning  an  individual  target. 

2.  Good,  suboptimal  strategies  for  sensor  management.  We  were  also  investigating 
techniques  for  developing  good,  computable,  suboptimal  strategies  for  sensor  management 
problems  with  no  known  computable,  optimal  strategies.  A  particular  focus  had  been  on 
developing  farsighted  strategies  and  detennining  the  benefits  such  strategies  may  have  over 
myopic  strategies.  We  have  been  specifically  interested  in  farsighted  strategies  whose 
computation  involves  a  decomposition  into  single  sensor  problems. 

3.  Optimal  sensor  management  strategies.  Another  topic  of  study  was  the  development  of 
techniques  for  deriving  optimal  sensor  management  strategies.  As  mentioned  previously, 
computing  optimal  sensor  management  strategies  in  general  is  often  intractable.  For  special 
cases,  there  exist  techniques  for  deriving  an  optimal  solution  that  can  be  computed 
efficiently,  often  as  a  result  of  a  decomposition  of  the  problem  into  a  set  of  single  target 
problems.  Being  able  to  compute  optimal  solutions  is  useful  for  many  reasons  including  the 
following  two.  First,  the  quality  of  performance  bounds  or  suboptimal  strategies  can  be 
evaluated  by  comparing  the  performance  predicted  by  the  bounds  or  resulting  from  the 
suboptimal  strategies  on  a  special  sensor  management  problem  for  which  one  can  compute 
the  optimal  solution.  Second,  the  optimal  solution  to  a  special  sensor  management  problem 
can  be  incorporated  as  a  component  of  a  suboptimal  solution  to  a  more  general  sensor 
management  problem.  Various  techniques  exist  currently  for  deriving  optimal  solutions  to 
sensor  management  problems.  For  example,  a  particular  class  of  problems  for  which 
techniques  exist  is  the  class  of  multi-armed  bandit  problems.  However,  existing  techniques 
do  not  apply  to  the  sensor  management  problems  of  interest  in  this  program.  We  were 
investigating  novel  techniques  for  computing  optimal  sensor  management  strategies  that 
could  be  used  to  verily  the  quality  of  bounds  or  of  suboptimal  sensor  management  strategies 
developed  for  this  program. 
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We  analyzed  these  approaches  in  the  context  of  ISR  scenarios  with  air-to-ground  sensors. 
Sensors  in  such  a  scenario  are  being  used  to  detect,  track,  and  classify  ground  targets.  The 
sensors  include  agile  airborne  radars,  which  have  a  constrained  field  of  view  but  can  be 
instantaneously  steered  to  observe  a  specific  area  on  the  ground  within  a  wide  field  of  regard.  In 
addition,  the  sensors  may  be  able  to  operate  in  different  modes.  Each  mode  may  have  distinct 
characteristics  and  be  suitable  for  observing  specific  activities. 

3.2  ACCOMPLISHMENTS 


Our  accomplishments  during  the  program  have  included  those  in  the  following  list.  Details  on 
each  of  these  are  provided  in  the  subsequent  sections  and  appendices. 

•  Developed  a  bound  on  optimal  sensor  management  performance.  The  bound  is 
derived  by  relaxing  the  problem.  In  particular,  almost  sure  constraints  in  the  problem  are 
relaxed  to  expected-use  constraints.  For  the  relaxed  problem,  we  have  proved  that  the 
optimal  sensor  control  policy  can  be  expressed  as  a  mixture  of  local  sensor  policies.  In 
this  setting,  a  local  sensor  policy  is  one  which  is  only  a  function  of  a  single  target's  state. 
The  value  of  the  optimal  control  can  thus  be  computed  by  solving  sub  problems 
associated  with  each  target.  We  have  verified  that  the  bound  is  tight  for  a  special  case  for 
which  we  derived  the  optimal  strategy.  The  bound  has  been  used  to  evaluate  the 
performance  of  farsighted  and  myopic  strategies  to  manage  a  sensor  to  classify  targets. 

•  Developed  and  analyzed  farsighted  sensor  management  strategies  for  controlling  a 
multimode  sensor.  In  particular,  we  developed  two  types  of  farsighted  algorithms  for 
managing  a  sensor.  The  algorithms  control  where  the  sensor  points  as  well  as  the  mode 
to  use.  The  sensor  modes  are  assumed  to  require  different  amounts  of  time  and  be 
suitable  for  observing  targets  in  particular  states.  We  also  identified  a  myopic  strategy 
for  controlling  this  type  of  sensor.  All  three  techniques  were  evaluated  in  a  simulation  of 
an  ISR  scenario  in  which  a  multimode  radar  is  used  to  track  targets  as  they  start  and  stop. 
The  results  indicate  that  the  farsighted  algorithms  perform  better  than  the  myopic 
approach. 

•  Derived  a  novel  set  of  sufficient  conditions  for  optimality  in  Markov  decision 
problems.  The  conditions  apply  to  finite-horizon  Markov  decision  problems  with  a 
tenninal  reward.  We  applied  the  conditions  to  verify  the  optimality  of  strategies  we 
developed  for  two  different  sensor  management  problems.  The  first  is  a  class  of 
symmetric  binary  classification  problems.  Specifically,  a  single  sensor  is  being  managed 
to  collect  discriminatory  data  to  classify  targets  being  one  of  two  types.  The  second  is  a 
class  of  search  problems.  The  first  type  of  problem  is  also  one  for  which  the  perfonnance 
bounds  apply.  Thus,  the  conditions  derived  for  optimality  enabled  us  to  investigate  the 
quality  of  perfonnance  bounds  by  comparing  the  bound  to  the  optimal  perfonnance  for  a 
special  case. 

•  Presented  an  overview  paper  on  sensor  management  at  the  IEEE  Automatic 
Control  Conference.  The  paper  provides  an  overview  of  the  problem  of  managing 
sensor  resources  in  a  closed-loop  sensor  fusion  system.  We  formulated  the  problem  in  a 
stochastic  dynamic  programming  framework.  In  so  doing,  we  exposed  structure  in  the 
problem  resulting  from  target  dynamics  being  independent  and  discussed  how  this  can  be 
exploited  in  solution  strategies.  We  illustrated  situations  in  which  we  believe  such  sensor 
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management  techniques  are  especially  beneficial  with  two  examples.  The  focus  of  both 
examples  was  on  air-to-ground  tracking. 

•  Submitted  a  paper  on  farsighted  sensor  management  strategies  for  move/stop 
tracking  to  the  International  Conference  on  Information  Fusion.  We  considered  the 
sensor  management  problem  arising  in  using  a  multi-mode  sensor  to  track  moving  and 
stopped  targets.  The  sensor  management  problem  is  to  determine  what  measurements  to 
take  in  time  so  as  to  optimize  the  utility  of  the  collected  data.  Finding  the  best  sequence 
of  measurements  is  a  hard  combinatorial  problem  due  to  many  factors,  including  the 
large  number  of  possible  sensor  actions  and  the  complexity  of  the  dynamics.  The 
complexity  of  the  dynamics  is  due  in  part  to  the  sensor  dwell-time  depending  on  the 
sensor  mode,  targets  randomly  starting  and  stopping,  and  the  uncertainty  in  the  sensor 
detection  process.  For  such  a  sensor  management  problem,  we  proposed  a  novel, 
computationally  efficient,  farsighted  algorithm  based  on  an  approximate  dynamic 
programming  methodology.  The  algorithm’s  complexity  is  polynomial  in  the  number  of 
targets.  We  evaluated  this  algorithm  against  a  myopic  algorithm  optimizing  an 
information-theoretic  scoring  criterion.  Our  simulation  results  indicate  that  the  farsighted 
algorithm  performs  better  with  respect  to  the  average  time  the  track  error  is  below  a 
specified  goal  value. 

•  Submitted  a  paper  on  bounding  optimal  sensor  management  performance  to  the 
IEEE  Conference  on  Decision  and  Control.  We  considered  a  network  of  sensors,  each 
of  which  has  limited  sensing  resources,  which  is  tasked  with  collecting  noisy 
classification  information  on  a  group  of  unknown  objects.  The  amount  of  resources 
required  a  given  sensor  to  measure  an  object  depends  on  the  specific  sensor-object 
geometry.  Sensors  exchange  collected  information  to  estimate  object  identities  and 
coordinate  which  measurements  to  collect  next.  This  paper  describes  a  computable  lower 
bound  on  the  classification  error  that  can  be  achieved  by  a  causal  adaptive  sensing 
schedule.  This  bound  is  based  on  a  formulation  of  the  adaptive  sensing  problem  as  a 
partially  observed  stochastic  control  problem.  Expanding  the  admissible  control  space  of 
this  problem  leads  to  a  relaxed  problem  with  simpler  decision  structure  for  which  the 
bounds  can  be  computed.  The  bound  computations  are  illustrated  for  several  examples 
involving  100  unknown  objects,  and  compared  with  the  Monte  Carlo  performance  of 
specific  adaptive  sensor  scheduling  algorithms.  Comparisons  with  optimal  scheduling 
algorithms  for  special  cases  illustrate  the  tightness  of  the  bounds. 

What  follows  in  the  next  few  sections  and  appendices  are  details  on  the  accomplishments 
summarized  in  Section  3.2.  Each  of  the  sections  provides  a  self-contained  presentation  on  one  of 
the  accomplishments. 
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4  FARSIGHTED  ALGORITHMS  FOR  CONTROLLING  SENSOR  MODES 


4.1  INTRODUCTION 

Here,  we  focus  on  investigating  the  advantages  of  applying  farsighted  policies  to  sensor  resource 
management  (SRM)  problems.  In  particular,  we  are  interested  in  detennining  if  there  are  SRM 
problems  where  using  farsighted  policies  is  beneficial.  We  want  to  explore  and  characterize 
such  problems,  as  well  as  find  out  what  kind  of  benefits  we  may  expect. 

An  SRM  problem  is  typically  characterized  by  a  set  of  targets  of  interest,  a  specific  mission 
objective,  and  a  set  of  available  sensors.  The  goal  of  the  sensor  manager  is  to  allocate  the  sensor 
resources  among  the  targets  in  time  to  support  the  mission  success.  The  sensor  resources  have  to 
be  allocated  in  the  presence  of  uncertainty  associated  with  obtaining  a  measurement  (e.g.,  those 
uncertainties  resulting  from  the  sensor  detection  process). 

Our  conjecture  is  that  a  farsighted  approach  is  better  than  a  myopic  approach  for  the  SRM 
problem  where  the  sensor  dwell-time  required  to  complete  a  sensor  task  is  different  for  different 
tasks.  A  myopic  approach  selects  a  sensor  action  based  on  the  current  information  only.  Thus, 
this  approach  is  oblivious  to  the  time  required  to  complete  the  selected  action  and  of  future 
benefits  resulting  from  the  action.  On  the  other  hand,  a  farsighted  approach  accounts  for  the 
action’s  benefits  as  well  as  the  time  it  takes  to  receive  benefits.  We  believe  that  a  farsighted 
approach  having  the  ability  to  anticipate  future  consequences  resulting  from  the  actions  taken 
now  will  yield  sensor  schedules  that  manage  the  resources  better  and  support  the  mission  success 
better  than  a  myopic  approach. 

To  support  our  conjecture,  we  consider  analyzing  a  sensor  management  problem  arising  in 
move/stop  tracking  with  multimode  radar.  In  particular,  we  are  interested  in  tracking  ground 
targets  through  their  motion  state  transitions  with  a  radar  sensor  having  two  modes:  an  MTI 
mode  for  detecting  moving  targets  only  and  an  FTI  mode  for  detecting  stopped  targets  only. 
These  modes  have  dwell  durations  that  differ  by  an  order  of  the  magnitude;  an  FTI  dwell  is  about 
100  times  longer  than  an  MTI  dwell,  which  adds  to  the  problem  complexity.  The  goal  is  to 
manage  sensor  resources  to  provide  continuous  tracking  of  the  targets. 

For  this  problem,  we  compare  a  myopic  entropy-based  SRM  algorithm  to  farsighted  SRM 
algorithms.  The  farsighted  SRM  algorithms  generate  sensor  actions  by  evaluating  an  objective 
function  parameterized  by  predictions  of  target  state.  The  objective  function  is  constructed  to 
value  the  future  benefits  of  a  measurement  obtained  now.  The  algorithms  also  estimate  the 
probability  of  a  target  sitting  or  moving  from  past  reports  so  as  to  evaluate  the  expected  benefits 
of  MTI  and  FTI  sensor  modes.  The  expected  benefit  evaluation  accounts  for  the  dwell-time  of 
each  mode.  The  algorithms  address  the  combinatorial  complexity  of  the  problem  through  the  use 
of  a  rollout  approach,  which  is  described  in  the  next  section. 

4.2  BASIC  PRINCIPLES  OF  ROLLOUT  ALGORITHMS 

A  rollout  algorithm  is  a  dynamic  programming  approach  that  evaluates  an  action  by  estimating 
near  and  far- future  benefits  resulting  from  the  current  state  and  the  action  choice.  The  near- 
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future  benefits  are  computed  by  predicting  the  action’s  consequences  over  the  look-ahead 
planning  stages.  The  far-future  benefits  are  the  benefits  accumulated  after  the  look-ahead 
period.  In  the  rollout  approach,  the  far-future  benefits  are  computed  as  the  benefits  resulting 
from  applying  a  fixed  policy.  Figure  2  illustrates  the  rollout  approach. 


Figure  2:  A  rollout  approach  to  evaluate  near  and  far  future  benefits  of  an  action  taken  at 

the  current  state 

Now  we  introduce  some  notation  and  formally  describe  the  rollout  approach,  starting  with  the 
optimality  principle  of  the  dynamic  programming  theory.  In  particular,  for  a  discrete  time 
system,  the  optimality  principle  states  that  every  optimal  policy  n  satisfies  the  following 
relation: 


7V*(S)  =  argmaxlR(S,u)  +  a for  all  states  S, 


ugU(S) 


where 

-  S  is  state  of  the  system 

-  U(S)  is  the  set  of  controls  available  in  state  S 

-  R(S,u )  is  the  instantaneous  reward  received  at  state  S  for  selecting  control  u 


(1) 


-  a  is  a  discount  factor  satisfying  ae(0,l) 

-  w  is  a  random  outcome  of  control  u 

He  He 

-  J  is  the  optimal  reward  (the  reward  collected  under  policy  n  ) 

-  J{S,u,w)  is  the  function  specifying  the  new  state  to  which  the  system  transitions  from 
state  S  under  control  u  and  the  control  outcome  w. 
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The  rollout  approach  generates  a  policy  n  by  replacing  the  term  J*  (f(S,u,w ))  in  Equation  (1) 
with  a  reward  /^collected  from  state  f(S,u,w )  under  some  policy  n.  The  resulting  policy  n  is 
a  one-step  look-ahead  policy  satisfying  the  following  relation 

n(S)  =  argmaxi R(S,u)  + a E [/^(/(S^w))]},  for  all  states  S,  (2) 

where  n  is  some  policy  whose  reward  J n  can  be  efficiently  computed.  This  relation  defines  a 
rollout  approach  that  we  use  to  solve  our  SRM  problem.  We  consider  two  different 
implementations,  one  maximizing  precision  and  one  minimizing  error  as  discussed  in  the 
following  section. 

4.3  SRM  ALGORITHMS  FOR  MOVE-STOP  TRACKING 

4.3.1  Precision  Maximizing  Algorithm 

Here  we  formulate  the  SRM  problem  for  move-stop  tracking  as  a  dynamic  programming 
problem,  and  we  present  a  farsighted  SRM  algorithm  for  solving  it.  Initial  development  of  this 
algorithm  was  perfonned  under  a  SBIR  program  [4].  That  work  assumes  the  dwell  times  of  all 
modes  are  the  same  duration.  The  extension  considered  here  addresses  issues  associated  with 
the  modes  having  different  durations. 

4.3. 1.1  Formulation 


We  model  the  SRM  problem  for  move-stop  tracking  as  an  infinite -horizon,  continuous-time 
stochastic  dynamic  programming  problem.  The  system  to  be  controlled  is  the  tracker.  The  state 
in  the  dynamic  program  consists  of  the  target  track  states.  Here,  a  control  choice  is  specified  by 
a  target  at  which  to  look  and  the  sensor  mode  to  use.  At  any  time  and  any  state,  the  available 
control  options  are  to  look  at  any  of  the  targets  currently  in  track  and  to  use  one  of  the  two  sensor 
modes.  A  sensor  measurement  is  a  random  outcome  of  the  control  choice  and  affects  the  future 
evolution  of  the  tracker  state. 

For  a  reward,  we  chose  a  function  that  rewards  states  with  sufficiently  high  precision  (i.e.,  small 
error).  In  particular,  the  total  expected  reward  accumulated  has  the  following  form 


i= 1 


(3) 


where 

•  y  is  a  discount  factor  specifying  the  rate  at  which  the  future  controls  contribute  to  the 
total  reward, 

•  n  is  the  number  of  tracks  currently  in  the  tracker 

•  Vj  is  a  priority  value  of  target  i 

•  R,(-)  is  the  instantaneous  reward  (discussed  below) 
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•  St{t)  is  the  state  of  track  i  (the  estimated  error  variance  of  track  i  at  time  t) 

•  u(t)  is  the  control  selected  at  tracker  state  (S\(t),...,Sn(t)). 

The  reward  Rt  is  given  by 


R;(S,)  = 


if  StZG„ 
if  St>G„ 


where  G,  is  the  goal  state  for  track  i  (desired  error  variance  for  track  /). 


4.3. 1.2  Precision  Maximizing  SRM 


(4) 


The  sensor  management  problem  is  to  find  a  sequence  of  controls  maximizing  the  total  expected 
reward  shown  in  Equation  (3).  In  what  follows,  we  describe  an  SRM  algorithm  that  generates  an 
approximate  solution.  The  algorithm  relies  on  two  different  approximations,  model 
approximation  and  optimal-policy  approximation.  The  first  of  type  of  approximation  involves 
using  a  prediction  model  to  approximate  the  track  states.  The  second  type  of  approximation 
involves  use  of  a  near-optimal  policy  instead  of  the  optimal  scheduling  policy  for  the 
approximate  model  of  the  tracking  system.  We  subsequently  discuss  these  two  approximations  in 
detail. 


Prediction  Model 

The  architecture  of  the  SRM  system  is  illustrated  in  Figure  3.  The  SRM  evaluates  the  sensor 
actions,  in  tenns  of  the  objective  function,  Equation  (3),  by  measuring  current  and  future  benefits 
resulting  from  an  action  selected  at  the  current  time.  The  future  benefits  of  an  action  are 
computed  using  the  SRM  prediction  model,  which  predicts  the  future  target  behavior. 
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Sensor  Action 


Figure  3:  SRM  Architecture 


From  the  SRM  point  of  view,  a  target  is  characterized  by  a  collection  of  attributes  including 
target  mode  probability  distribution,  and  target  kinematic  state,  consisting  of  position  error 
covariance.  To  accommodate  efficient  computation  of  the  expected  policy-reward,  our  sensor 
resource  manager  uses  a  tracker  predictive  model  approximating  the  tracker.  This  model  is  based 
on  the  following: 

Assumption : 

1 .  Each  target  is  either  moving  or  is  stopped,  but  the  target  motion  state  is  unknown. 

2.  A  target  track  is  dropped  if  the  target  is  not  detected. 

Assumption  1  is  realistic  for  cases  where  the  changes  in  target  motion  take  longer  than  planning 
and  executing  a  sensor  action.  Assumption  2  is  more  conservative  than  necessary  (a  target  track 
may  continue  even  with  one  or  more  missed  detections).  Flowever,  the  resulting  model  is  useful 
for  planning  purposes.  Furthermore,  these  assumptions  restrict  the  branching  of  the  control- 
outcome  space  of  any  policy.  This  allows  us  to  evaluate  our  farsighted  policy  without  using 
costly  Monte  Carlo  simulations. 

Under  this  assumption,  the  probability  distribution  of  the  outcomes  of  any  finite  sequence  of 
control  actions  can  be  computed.  We  exploit  this  in  our  subsequent  algorithm  development. 
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The  outcome  of  a  single  sensor  action  is  either  a  detection  or  a  missed  detection,  which  results  in 
updating  the  target  states  either  with  or  without  a  measurement.  For  each  of  these  outcomes,  we 
give  the  target  mode  and  kinematic  state  update  equations. 

Update  with  a  Measurement 

In  this  case,  the  target  state  is  predicted  to  the  next  update  time  and  updated  with  a  measurement. 
The  state  prediction  equation  is 

si  0 t  +  T)  =  S,  0 t)  +  T(  p,  (0  •  Qn  +  pi2  ( t )  •  0,2 ) .  (5) 


where 

T  is  the  time  increment, 

S,(t)  is  the  kinematic  state  of  target  i  at  the  current  time  t, 

Sj(t+T)  is  the  predicted  state  of  target  /'  (into  the  future  time  t+  T), 

( Pn(t),Pa(t ))  is  the  current  probability  distribution  for  the  mode  of  target  i,  with  p,\  and  pa 
being  the  probability  of  moving  and  stopped  respectively, 

(QiuQa)  is  the  process  noise  covariance  vector  of  the  IMM  fdter,  with  Qn  and  Qn  being 
the  process  covariances  of  the  filters  modeling  respectively  moving  and  stopped  modes 
for  target  i. 

The  target  is  detected  at  time  t+T using  sensor  mode  j  and  the  target  state  is  updated  as  follows: 


SQt  +  T) 


S,  (t  +  T)  rtj 
St  (t  +  T)  +  r.j  ’ 


(6) 


where  /y  is  the  measurement  noise  covariance  for  target  i  and  sensor  mode  j. 

The  target  mode  probability  distribution  at  the  update  time  is 

Pij(t+T)=  1,  and  pim(t+T)  =  0  for  modes  nipj.  (7) 

Update  without  a  Measurement 

In  this  case,  the  target  state  is  predicted  to  the  next  update  time  according  to  Equation  (5).  Since 
the  target  is  unobserved,  the  target  mode  probability  distribution  does  not  change  according  to 
Assumption  1. 


Algorithm 

As  a  first  step  toward  maximizing  the  objective  function  in  Equation  (3),  we  replace  the 
continuous-time  optimization  problem  with  a  discrete-time  approximation.  In  particular,  we 
discretize  the  time  by  letting 

tk  =  kd  for  <7>0  and  £=0,1,2,....  (8) 

We  then  approximate  the  continuous-time  objective  function  in  Equation  (3)  with  a  piece- wise 
constant  function  resulting  in  the  following  discrete -time  objective  function 

oo  n 

k=0  i= 1  (Q\ 
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where  ae(0,l)  is  a  discount  factor  given  by  a  =  e~rd .  For  this  objective  function,  we  develop  a 
near-optimal  policy  n  based  on  the  rollout  approach.  In  particular,  for  our  SRM  problem  with 
variable  duration  dwell  times,  the  rollout  relation  in  Equation  (2)  takes  the  following  form 

&(S)  =  argmax j]?(£,y,m, Am)  +  a*(m)  E\j n(J (S,j,m,Am,wjm))~^,  for  all  states  5,  (10) 

where  maximization  takes  place  over  all  target  tracks  j  and  sensor  modes  m,  and 
A,„  is  the  dwell  time  for  sensor  mode  m 

R(S,j,m,Am )  is  the  reward  collected  from  state  5  under  control  u=(i,m )  during  time 
interval  Am 

K(m)  is  the  dwell  time  in  units  of  8  for  mode  m 
-  Jx  is  the  expected  reward  accumulated  under  some  policy  n 

f(S,j,rn,Am,wJm )  is  the  state  of  the  tracker  at  the  end  of  the  time  interval  Am  when  the 
measurement  is  made, 

w.  is  a  random  variable  taking  values  1  or  0  that  indicate  detecting  and  correctly 

associating  the  detection  of  target  j  for  sensor  mode  m. 

Since  the  MTI  dwell  time  is  typically  less  than  the  FTI  dwell  time,  one  can  use  the  MTI  dwell 
time  as  a  unit  of  time  and  express  the  FTI  dwell  time  as  a  multiple  tMti  of  it.  By  viewing  A mti  as 
the  time  unit  measure  8,  we  have  the  following  values  for  the  dwell  times  K(m)  in  Equation  (10). 

K(MTI)  =  1  and  K(FTI)  =  tMti .  (11) 

The  mode  rewards  R(S,j,m,Am)  for  target  j  have  the  following  form: 


R(A,y,M77,AMr/)  =  X^(^)?  (12) 

Z=1 

_  K(FTI)- 1  n 

R (S, j, FTI ,AF17)=  X  a'S^ACO),  (13) 

t=0  i= 1 

where  S=(S Sn)  is  the  current  tracker  state  and  S(t)  is  its  state  t  units  later  [5(0)  =  5],  5,  is  the 
current  state  of  track  i  and  S,(t)  is  its  state  t  units  later,  while  the  reward  R,  is  given  by  Equation 

(4). 

We  next  describe  the  evaluation  of  the  term 


E[jAf(S’J’m,Am,w  ))] 

w  L  J  (14) 

in  the  right  hand  side  of  the  rollout  Equation  (10).  The  expectation  is  with  respect  to  the  two 
possible  outcomes  wjm=  1  and  w/m=0  indicating  a  detection  of  target  j  with  sensor  mode  m.  Hence, 
we  have 


E  \_JX  (f(S,  j,  m,  Am ,  w]m  ))]  =  Jn  (AS,  j,  rn,Am ,  \))Prob{wjm  =  1} 

+  Jn  (1  (S,  j,  m,  Am ,  0  ))Prob{wjm  =  0} , 


The  probability  of  outcomes  Wjm=  1  and  wjm= 0  are  given  by 


(15) 
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(16) 


Prob{wjm  =  1}  =  pjlim  ,  Prob{wjm  =  0}  =  1  -  PiJim , 

where 

-  pjm  is  the  probability  that  the  target  mode  is  in  (matching  the  sensor  mode) 

-  pjm  is  the  detection  probability  of  sensor  mode  in  for  target  j. 

The  target  mode  probabilities  pjm  are  computed  using  the  mode  likelihoods  generated  by  the 
IMM  filter. 


We  now  discuss  the  choice  of  policy  n.  Motivated  by  the  desire  to  have  a  good  policy  whose 
expected  reward  can  be  computed  for  any  initial  state,  we  consider  a  policy  n  having  the 
following  properties'. 

1 .  A  target  is  observed  with  either  MTI  or  FTI  mode  at  all  times. 

2.  Initially,  the  targets  are  sorted  in  a  list  according  to  some  criterion.  Then,  these  targets  are 
observed  according  to  the  list  as  follows:  each  target  is  observed  until  either  its  track  error 
decreases  below  the  desired  value  or  its  track  is  dropped.  If  the  track  error  is  decreased  below 
the  desired  value,  the  target  is  revisited  at  the  rate  that  keeps  its  track  error  below  the  desired 
value. 


We  assume  that  the  sensor  can  revisit  the  targets  with  the  rates  that  keep  the  track  errors  below 
the  desired  values. 

Under  the  policy  n,  it  is  assumed  that  the  sensor  uses  one  and  the  same  mode  when  observing  a 
target.  The  sensor  mode  m(i)  to  be  used  for  observing  target  i  is  detennined  as  the  most  likely 
target  mode  in  the  current  target  mode  probability  distribution p,  =  (pn,pa),  i.e., 

m(i)  =  arg  max  {pn ,  pj2 } .  (17) 

Given  the  current  tracker  stated  =  (St,...,Sn)  ,  policy  n sorts  the  tracks  according  to  their  vicinity 

to  the  goal  state,  i.e.,  the  tracks  are  sorted  according  to  the  values  [Sj/Gj  i  =  1,2 .  The 

order  is  motivated  by  that  generated  according  to  an  index  rule  policy  such  as  that  discussed  in 
[4].  The  targets  are  then  considered  in  that  order,  and  to  each  target  track  the  following  rule  is 
applied: 

If  the  target  track  state  St  is  outside  the  track  accuracy  goal  region  {.s'  |  .s'  <  G,},  the  target  is 
consecutively  observed,  with  the  sensor  mode  matching  the  target  mode,  until  the  time  its  state 
Sft)  enters  the  track  accuracy  goal  region.  After  that  time,  the  target  is  periodically  revisited 
with  the  smallest  revisit  rate  that  guarantees  the  state  will  remain  within  the  goal  region. 

It  is  computationally  prohibitive  to  exactly  evaluate  the  policy  reward  J due  to  unpredictable 
target  mode  changes  in  the  future  and  due  to  the  explosion  of  possibilities  of  observation 
outcomes  over  a  long  period  of  time.  We  approximately  evaluate  the  policy  reward  fusing  the 
predictive  model  for  the  evolution  of  the  target  mode  probability  distribution  and  kinematic  state, 
as  described  earlier 

The  policy  reward  J K  has  the  following  form 
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(18) 


■4(S)  =  2>,(S). 


For  notational  convenience  assume  that  the  order  of  the  targets  is  1,  2 ,...,«  when  the  targets  are 
sorted  according  to  values  Is.  / Gt  \  i  =  1,2,...,  wj  .  Then,  for  each  target,  we  detennine  the  target 

mode  and  the  associated  mode  probability.  The  mode  probabilities  are  derived  from  the  mode 
likelihoods  generated  by  the  IMM  filter,  and  the  target  mode  is  defined  as  the  most  likely  of  the 
modes. 


Suppose  that  the  states  of  the  first  T  targets  are  within  their  goal  region,  i.e., 

Sj<  Gj  for  i=  1,  2 

For  these  targets,  we  have 


J,{S,)  =  Ln 


1,2 


(19) 

(20) 


where  7,  is  the  long-term  reward  accumulated  during  revisits  (to  be  discussed  shortly). 

Consider  now  the  targets  T +1 ,...,«,  which  are  outside  their  corresponding  goal  regions.  Suppose 
that  T\  is  the  observation  time  required  for  state  ST+l  to  enter  the  goal  region  {.v  |  s  <  G'; t , } .  We 
then  have 


JUS,a)=(aP^M,.nf-L„<-  <21> 

where  m(  T +1)  is  the  sensor  mode  that  is  used  for  observing  target  T  +  1.  While  observing  target 
T +1,  the  states  of  the  remaining  targets  T  +2 ,...,«  have  evolved  to  Si+1(Tt  ),...,Sn(Tt  ).  Suppose 
now  that  Ti  is  the  observation  time  required  for  state  S :  l2(T] )  to  enter  the  goal  region 
{s  |  s  <  Gt+2 } .  Then,  we  have 


JT+2  (S  )  =  («£ 

Continuing  in  this  manner,  we  can  see  that 

JT+j(S)  =  {a  pT+lMT+J))T' 


i  +1  ,m(i  +2)  / 


■L 


7+2- 


J-Lr+P 


for  j  =  \,...,n-i. 


(22) 

(23) 


where  7)  is  the  time  for  variance  ST+j(Tx  + ...+  7  , )  to  enter  the  goal  region  {.v  |  ,v  <  GJ+j)  . 


We  now  discuss  the  long-tenn  rewards  7,-  accumulated  during  periodic  revisits  of  the  targets.  As 
we  mentioned  earlier,  once  the  states  of  all  targets  enter  their  corresponding  goal  regions,  the 
targets  are  revisited  at  a  constant  rate.  Under  the  assumption  that  the  target-mode  probability 
distribution  does  not  change,  once  a  stopped  target  state  is  within  its  corresponding  goal  region, 
it  remains  there  for  the  rest  of  the  time.  Therefore,  the  stopped  targets  are  not  revisited,  and  the 
long-tenn  reward  7,  associated  with  a  stopped  target  i  is  computed  as  follows 


a  =Z«'r  =t 

/=0  t 


V, 


-a 


(24) 


We  next  discuss  the  long-term  reward  Li  associated  with  a  moving  target  i.  Let  M  be  the  length 
of  the  revisit  interval  required  for  keeping  the  state  of  target  i  within  the  goal  region.  Without 
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loss  of  generality  we  may  assume  that  the  sensor  revisits  the  target  i  at  times  t=jM,j= 0, 1 . . . .  Note 
that,  in  view  of  our  assumptions,  the  revisit  process  continues  for  as  long  as  the  target  is 
detected,  and  the  process  stops  when  the  target  is  lost  as  a  result  of  a  missed  detection. 
Furthermore,  the  reward  V  is  collected  while  the  target  is  still  tracked,  and  the  reward  ceases 
when  the  target  is  lost. 

During  the  revisit  stage,  the  target  lifetime  is  a  random  variable  taking  value  jM  with 
probability  ( 1  - P!^{i) )  for  j=  1,2,...  .  Let  p  be  the  reward  accumulated  between  any  two 
consecutive  revisits.  When  the  lifetime  takes  value  jM,  the  lifetime  reward  RewijM)  is  given  by 

1  -njM 

Rew{jM)  =  p  +  aMp  +  ...  +  au-l)Mp  =  p- - w.  (25) 

1  —  a 

The  long-tenn  reward  Lt  is  equal  to  the  expected  lifetime  reward  during  the  revisit  period,  and  it 
can  be  seen  that 


L  =- 


P 


1 


(26) 


Since  p  is  the  reward  accumulated  during  the  subsequent  revisits,  we  have 

/9  =  ^(l  +  a  +  ...  +  a"-,)  =  ^t - . 

'  1  ~a  (27) 

By  substituting  this  p  in  Equation  26,  we  see  that  the  long-tenn  reward  for  a  moving  target  i  is 
given  by 


L,  =  V 


I  -a  ' 


(28) 


We  note,  here,  that  the  preceding  precision  maximizing  SRM  algorithm  extends  to  the 
multidimensional  case  by  replacing  the  variance  S)  with  the  trace  Tr(S,)  of  the  covariance  S). 


4.3.2  Error  Minimizing  Algorithm 


Here  we  present  an  error  minimizing  SRM  algorithm  as  applied  to  move-stop  tracking.  In  the 
dynamic  programming  formulation  of  the  SRM  problem  for  move-stop  tracking  (Section 
4. 3. 1.1),  we  use  an  instantaneous  cost  instead  of  a  reward.  In  particular,  a  target  incurs  a  nonzero 
cost  if  the  target  error  variance  exceeds  a  specified  error  variance  goal.  The  target  cost  is 
instantaneous,  and  it  is  proportional  to  the  difference  between  the  target  error  and  the  error  goal. 
Formally,  the  instantaneous  cost  at  time  t  for  target  i  is  given  by 

QiSjt))  =  Vi  •  max  {S0)-Gh  0 } ,  (29) 

where  Sjt)  is  the  target  error  variance,  F,  is  the  target  priority,  and  G;  is  the  error  goal  for  target  i. 
The  cost  of  the  composite  target  state  S(f)=(Si(f),. .  .,S„(f))  is  additive  over  the  targets,  i.e., 

C(S(0)  =  XC,(S,(0).  (30) 
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4.3.2. 1  Error  Minimizing  SRM 


For  a  continuous-time  system  the  rollout  relation  in  Equation  (2)  takes  the  following  fonn 

?r(S)  =  argminj  f C(S(t))  e~ndr  +  e~n“  E[j  (f(S(t  ),u, w))l  L  for  all  states  5,  (31) 

ugU(S)  [J0  w  J 

where 

5  is  the  state  of  targets  at  the  current  time  /=(),  i.e.,  5=5(0) 

;ris  some  fixed  policy 

U(S)  is  the  set  of  controls  available  in  state  5 

tu  is  the  latency  time  associated  with  control  u  (for  a  sensor,  tu  is  the  dwell-time) 
y  is  an  exponential  decay  factor  satisfying  y> 0 
w  is  a  random  outcome  of  the  control  u 

-  f(S,u,w)  is  the  function  specifying  the  new  state  to  which  the  system  transitions  from  state 
5  under  control  u  and  the  control  outcome  w. 

We  assume  that  the  control  options  are  target-mode  pairs,  with  the  set  of  candidate  modes  being 
MTI,  FTI,  and  “idle”  (no  target  selected).  We  also  assume  that  the  “idle  mode”  has  infinite 
dwell  time,  so  that  once  the  idle  mode  is  selected  no  other  mode  can  ever  be  used  in  the  future. 
We  consider  a  rollout  where  tt  is  the  “idle”-policy,  i.e.,  the  policy  selects  the  “idle”  mode  at  any 
state.  For  such  a  policy  n  and  the  additive  cost,  the  minimization  on  the  right-hand  side  of 
Equation  (3 1)  reduces  to: 


min 

u<=U  ( S ) 


X 

i= 1 


lU 

} C(5,(/))C  ^dt  +  e^  ■e]Ci  ( f{S,{tu),u,w))e-pdt 


transitional  cost 


future  cost 


target  cost 


(32) 


where  the  set  of  controls  U(S)  is  the  same  for  all  states  5,  i.e.,  U(S)=U  for  all  5  with 

U  =  {(i,m)  |  i  e  m  e  {MTI ,  FTI,  idle}) .  (33) 


The  transitional  cost  is  incurred  from  the  current  time  until  the  execution  time  tu  of  control  u. 
The  future  cost  is  incurred  for  the  rest  of  the  time,  starting  from  the  time  immediately  after  the 
control  u  is  executed.  Computing  each  of  these  costs  requires  predicting  the  evolution  of  target 
states.  We  approximate  the  state  evolution  of  the  IMM  tracker  by  using  the  predictive  model 
described  in  Section  4. 3. 1.2.  This  simple  model  estimates  the  target  behavior  well  and  simplifies 
the  integral  computations  in  Equation  32. 


Algorithm 

We  now  focus  on  the  minimization  problem  given  in  Equation  (32).  At  first,  we  evaluate  the 
target  costs  in  Equation  (32)  for  a  given  control  u=(j,ju)  with  ju  e  {MTI,  FTI} .  For  target  j,  we 
consider  the  cost  for  the  cases  when  the  target  is  observed  and  is  unobserved.  Without  lost  of 
generality,  we  may  assume  that  the  current  time  is  t= 0,  so  that  the  current  target  state  is  5/0). 
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Cost  for  Observed  Target 

During  the  observation,  the  target  state  evolves  as  follows: 

SJ(t)  =  Sj(0)  +  t-(pnQn+pj2Qj2),  for  te[0,tu),  (34) 

where  p^i  and  ph2  are  the  probabilities  for  target  motion  modes  with  1  corresponding  to  moving 
target  and  2  corresponding  to  stopped  target.  From  the  preceding  relation  and  the  definition  of 
the  cost  [cf.  Equation  (29)],  it  follows  that 

Cj  (Sj  (0)  =  yj  ■ max  {Sy.  (0)  -  Gj  + 1  ■(pjiQji  +  Pj2Qj2  ) ,  0},  for  te[0,tu).  (35) 

Define 


Tj  =  max 


Gj-Sjj  0) 

PjiQn+PjiQji 


(36) 


and  note  that  Tj  is  the  time  the  target  variance  Sft)  exceeds  the  specified  variance  bound  Gy.  We 
refer  to  Tj  as  crossing  time  from  state  5/(0).  We  note  that  this  time  depends  on  the  initial  variance 
5/0),  i.e.,  Tj  =  7}(5y(0)),  and  it  may  be  infinite. 

It  can  be  seen  that  the  transitional  cost  has  the  following  fonn 


V, 


TranCostj (Sj (0))  =  —  max (0) - Gy. ,  0}  •  (l-e-*) 


7 


Vj\pnQn+Pj2Q]2) 


(37) 


7 


■  [(y0  +  l)e-r0 -(rtu+l)e~r‘“-\. 


where 


6  =  min  {/„,  Tj). 

We  now  focus  on  the  expected  future  cost.  Let  Sj(tu,w )  denote  the  target  state  immediately  after 
the  observation  time  tu,  where  w=0  or  vv=  1  indicates  that  the  state  Sj(tu,w )  results  from  the  update 
with  or  without  a  measurement,  respectively.  Since  the  target-mode  probability  distribution  does 
not  change  when  the  target  is  unobserved  (cf.  Assumption  1),  the  future  cost  from  state  Sj(tu,w) 
for  w=0  is  given  by: 

oo 

FutureCostj (SJ (tu , 0))  =  Vj  •  j max  (tu , 0)  -  G}  +  t(pjlQJl  +  pj2Qj2 ) ,  o}  e~r‘dt.  (38) 


However,  we  have  pjM= 1  for  the  observed  target.  Therefore,  the  future  cost  from  state  S/tu,w)  for 
w=  1  is  given  by: 

oo 

FutureCostj  (Sf  (tu ,  1))  =  Vj  ■  |  max  { Sj  (tu ,  0)  -  Gf  +  tpJUQju ,  o}  e  y‘dt.  (39) 

lu 

Let  Tj(w)  be  the  crossing  time  for  target  j  starting  from  state  Sj(tu,w),  i.e., 
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(40) 


T ,(0)  =  max 


TAX)  =  max- 


Gj-Sj^O)  '  0, 
PjiQji  +  PjiQn  ’ 

G.-SXt,,  0) 


J  J'-]' 

PivQ, 


0  . 


Then,  it  can  be  seen  that  the  future  cost  accumulated  from  state  Sj(tu,w )  is  given  by 

V, 


FutureCostj  (Sf  (tu ,  w))  =  —  max  | Sj  (tu ,  w )  -  Gj ,  0|  e  yt"  +  J  -  (;/  •  r(w)  +  X)e 


where 


r 


QAw)  = 


r 


' PjxQn  +  PpQn  for  vv  = ()- 
\PjmQj*  for  w  =  l. 


r(vv)  =  max  |/u,  r(vv)} 


for  w  =  0, 1 . 


(41) 


(42) 


(43) 


(44) 


The  expected  future  cost  is  a  weighted  sum  of  the  future  costs  corresponding  to  outcomes  w=l 
and  w=0.  Specifically,  it  is  given  by 

ExpFutureCost j ( Sj (tu ))  =  (3JU pju FutureCostj ( Sj (tu  J))  +  (l  - Pn,pn, ) FutureCostj ( Sj (t f , 0)),  (45) 


where  fAni  is  the  probability  of  detecting  target  j  with  sensor  mode  //. 

Then,  the  total  cost  for  target  j  is 

Tota/Cost  AS  ( 0))  =  TranCost  AS  (0))  +  FutureCost  AS  Atu  ,0)) 

r  i  (46> 

+  PJUPJM  [FutureCostj  (S\  (tu ,  1))  -  FutureCostj  (Sj  (tu ,  0))  J . 

The  first  two  terms  represent  the  cost  associated  with  the  event  of  not  observing  the  target.  The 
last  term  represents  the  expected  cost  reduction  resulting  from  target  detection.  Note  that  the  last 
tenn  is  non-positive,  i.e., 

PjuPju  [FutureCostj  (5;  (tu ,  1))  -  FutureCostj  ( Sj  (tu ,  0))]  <  0.  (47) 


Cost  for  Unobserved  Target 

For  a  target  i  with  z  ^  j ,  there  is  no  uncertainty  in  the  future  cost,  so  that  the  total  cost  is  given 
by 

TotaICostj  (Sj  (0))  =  TranCost  j  ( Sf  (0))  +  FutureCostj  (St  (tu ,  0)),  (48) 

where  transient  cost  TranCost ^.(O))  and  future  cost  FutureCost fS ft u ,  0))  are  given  by 
Equation  (37)  and  Equation  (38),  respectively. 

By  summing  the  total  costs  of  all  targets,  we  obtain  the  cost  associated  with  the  state  .S'(0)  and  the 
control  choice  u=(j,p )  for  p  e  {MTI,FTI} .  In  particular,  we  have 
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(49) 


Cost(S(0),j,  fi)  =  J'i  [ TranCostj  (S',  (0))  +  FutureCosti  (S,  (tu ,  0))] 

;= l 

+  [  FutureCostj  (SJ  (tj ,  1))  -  FutureCostj  (S  ■  (tj ,  0))] . 

The  cost  associated  with  the  control  option  u  ={j,idle)  is 

n 

Cost(S(0),j,id!e )  =  ^  [  TranCosti(Si  (()))  +  FutureCostl  (S,  (tu ,  ()))].  (50) 

(=i 

In  view  of  Equation  (47),  the  cost  of  control  u=(j,ju)  for  //  e  {MTI,FTI}  is  smaller  than  the  cost 
of  control  u  =(j,idle)  a  fixed  target  j,  i.e., 

n 

Cost(S  (0),  j9  //)  =  ^ \TranCosti  (S(  (0))  +  FutureCosti  (St  (tu ,  0))] 

i=\ 

+  PltlPju  [ FutureCostj  (SJ  ( t} ,  1))  -  FutureCostj  (S}  (; t}  ,0))] 

n 

<  ^  \TranCosti (S,  (0))  +  FutureCosti (S,  (tu ,  0))] 

(=i 

=  Cost(S(0),j,idle). 

Hence,  the  minimization  in  Equation  (32)  reduces  to 

min  Cost(S(0),j,ju),  (52) 

ye  {!,-,«} 

//e{ MTI,FTI } 

with  the  control  cost  CVi.s/(S( ()),/,//)  as  given  in  Equation  (49). 

The  preceding  error  minimizing  SRM  algorithm  extends  to  the  multidimensional  case  by 
replacing  the  variance  S,  with  the  trace  7>(S,)  of  the  covariance  S,. 

43.2.2  Myopic  Entropy-Based  SRM  Algorithm 

Here,  we  present  a  myopic  sensor  management  algorithm  that  serves  as  a  baseline  for  evaluating 
the  performance  of  the  farsighted  algorithms  discussed  in  the  preceding  sections.  We  do  not 
consider  a  myopic  approach  optimizing  the  dynamic  programming  formulation.  This  is  because 
the  myopic  property  is  not  well  defined  for  the  SRM  problems  where  different  control  actions 
have  significantly  different  execution  time.  In  particular,  this  is  the  case  with  the  SRM  problem 
for  move/stop  tracking,  where  the  system  state  transition  time  is  significantly  different  for 
different  controls  (specifically,  for  different  sensor  modes).  Thus  to  anticipate  benefits  at  the 
possible  future  states,  we  have  to  predict  into  the  future  over  significantly  different  time 
intervals,  which  is  not  a  property  of  a  myopic  approach. 

We  consider  an  algorithm  that  evaluates  sensor  actions  based  on  the  expected  decrease  in  the 
entropy  of  the  target-track  errors  per  unit  of  time.  The  algorithm  is  myopic  since  the  changes  in 
the  entropy  are  computed  only  for  a  single  sensor  action.  Specifically,  let  the  current  time  be  t=0 
and  let  the  current  system  state  be  S=(S\,...,S„).  The  entropy  ht  for  target  i  with  variance  St  and 
the  mode  probabilities  (pa,  pa)  is  given  by 
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(53) 


f  I  i 

h,  (St)  =  vi  -  Pn  log  [2 7rceSi  ]  +  -  pn  log  \lnceSi  ]  -  pn  log  pn  -  pi2  log  p, 
\Z  z 

=  ^ log  [2 nce]  +  ^  log  S,  -  Vt  (pn  log  pn  +  pi2  log  pi2 ) , 


where  V,  is  the  priority  of  target  i  and  ttc~  3.14  (see  [5],  Chapter  9).  As  seen  from  this  relation,  the 
target  entropy  is  measured  by  the  target-track  error  in  log-scale.  The  entropy  H  of  the  system  the 
current  time  is  defined  as  the  sum  of  the  current  target  entropies  A,: 

H(S)  =  J>,(S,)  where  5  =  (51,...,5II).  (54) 

1=1 


An  entropy-based  score  is  associated  with  each  control  u=(j,p).  The  score  is  equal  to  the 
expected  decrease  in  the  entropy  per  unit  of  time: 


£[tf(S(f„))]-tf(S) 
D(S,u)  =  - 


(55) 


where  S(tu )  is  the  state  to  which  the  system  transitions  under  the  control  u. 

At  any  decision  time,  the  entropy-based  SRM  algorithm  selects  a  control  having  the  minimum 
score  i.e.,  the  sensor  manager  solves  the  following  problem 

min  D(S,u) 

ueU 


U  =  {(./,  p)  |  j  e  {1, n) ,  p  e  {MTI,  FTI}}  .  (56) 

We  now  derive  the  explicit  form  for  D(S,u).  Under  any  control,  the  predicted  target  variances  at 
time  tu  are 


S!  (fu )  =  Sl  +  tu  ■  (pnQn  +  pnQn  ). 

Thus,  the  entropy  at  time  tu  for  unobserved  target  i  is  given  by 

ht  (St  (tu ))  =  y  log  [2 nce\  +  ^  log  St  (tu  )-Vl( pn  log  pa  +  pi2  log  pi2 ) . 


(57) 


(58) 


Under  a  control  u=(j,p)  and  outcome  w=  1  (corresponding  to  target  detection),  the  updated 
variance  of  target  j  is 


sxo  = 


S](tu)‘rj,v 


We  have  piu=  I  in  this  case,  so  that  the  entropy  at  time  tu  for  observed  target  j  is  given  by 

hi  (Sj  (tu  ))  =  y  log  [2  nce\  +  ^  log  S}  (tu ). 


(59) 


(60) 
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For  a  given  control  u=(j,/u ),  the  variances  St  of  targets  i  with  ijj  do  not  depend  on  the  control 
outcome  w,  the  following  holds  for  the  expected  entropy  at  time  tu: 


i= 1 

i*j 


(61) 


Z  vs;  c i . » + pupu  (*  is,  (t, » -  * .  <s;  «„ ») . 


By  substituting  the  expression  for  the  target  entropies  hi  [cf.  Equations  (25)  and  (60)],  we  obtain 

(62) 


E [H {S (tu ))]  =  hog [2nce\ X Vt  +  ^X V,  log Sj  (tu )-^JVi(pn  log pn  +  pn  log pi2 ) 

w  z  i= i  z  i*=i  i=i 


1 


1 


•  log  sj  (6, )  -  -  log  sj  (K )  +  Pji  l°g  Pn  +  Pj2  log  Pj2 

The  expected  entropy  change  is  given  by 

E}mS(tJ)]-H(S)  =  l;log[2!Tce]f:Vl+±fiV,\ogS;(0-f:V,(pa\ogpn+p,2\otip,2) 

"  Z  (=1  Z  i=l  1=1 


(j  .  1  h 

+Pj^Pj,/j  -  log  Sj  (tu )  -  -  log  SJ  (tu )  +  pJl  log  p j\  +  P]2  log  Pj 2  (63) 

v  /  /  / 


\  log  [2xce]  X  ^  ^  X  ^  lo§  Si  +  X  ^  ( Pa  lo§  Pn  +  Pa  log  />« ), 

^  1=1  ^  i= 1  1=1 


which  reduces  to 


E  [H (S(l „))]  -  H (S)  =  i  j; ; 1-  log  5*  (I  J  -  i  J  V,  tog  5, 


1 


1 


■ /'l.,,P  ,/i  -  log  (6, )  -  -  ■ log  X  (tu )  +  Pn  log  Pn  +  P  n  log  P  , 


i  2 


(64) 


i..  sf<o 


-log 


X(0 


Pn  log  Pn+Pj2  lQg  P 


j  2 


Thus,  the  myopic  entropy-based  SRM  selects  a  control  u  that  minimizes  the  time  averaged 
changes  in  the  entropy: 


g[g(j(Q)]-g(S)  i  +  sho  /W, 

t,  2t.  6"  B  s  t. 


6,  sfij  A 

-  log  d— +/>,-,  log/f,  +  P/2  logp/2 
Z  O/  o„l 


,  (65) 


over  all  u  eU  ,  where  6/  =  {(j,ju)\j  e  {1, //  e  {MTI,FTI}}  . 

For  the  multidimensional  case,  the  entropy  of  a  target  i  is  given  by 

K (St)  =  log  [2 nce\  +  ^ log  \St \  -  Vt  (pn  log pn  +  pi2  log  pi2 ) , 


(66) 
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where  S)  is  the  covariance  matrix,  N  is  the  size  of  the  matrix  S„  and  |S,-|  is  the  detenninant  of  Sj. 
In  this  case,  the  myopic  SRM  selects  a  control  u  that  minimizes  (over  all  u  e  U  )  the  time 
averaged  changes  in  the  entropy: 


£[//0S(O)]-//0S)  !  .  |s;«.) 

- - : - 

lu  Alu  i= 1  ° i 


B  p  V. 


N , 

wlo§ 


SXO 


SKK) 


+  Pji  log  Pn  +  Pj2  log  P 


j  2 


(67) 


4.4  SIMULATION  RESULTS 

Here,  we  present  our  simulation  model  and  the  test  results  obtained  for  the  precision  maximizing 
algorithm,  the  error  minimizing  algorithm,  and  the  myopic  entropy-based  algorithm. 

4.4.1  Model  Parameters 

We  start  with  a  detailed  discussion  of  the  tracker  and  sensor  parameters,  and  values  identified  to 
be  reflective  of  realistic  scenarios. 

Tracker  Model  Parameters 

In  the  tracker,  the  target  dynamics  are  modeled  by  an  interacting  multiple  model  (IMM)  filter 
(see  [6])  consisting  of  two  filters:  one  modeling  the  kinematics  of  a  “moving”  target  and  the 
other  modeling  the  kinematics  of  a  “stopped”  target.  It  is  assumed  that,  when  moving,  targets 
move  along  a  road  (i.e.,  along  a  line)  with  a  random  velocity  normally  distributed  with  specified 
root  mean  square  value.  The  target  kinematic  state  consists  of  the  target  location  estimate  and 
the  estimate  of  error  variance.  These  states  are  estimated  from  position  reports  generated  by  a 
single  sensor. 

Perfect  report  association  is  assumed,  so  that  each  report  is  associated  with  a  single  target.  The 
tracker  model  drops  the  track  if  the  target  position  error  variance  exceeds  a  specified  maximum 
value.  A  new  track  is  immediately  initialized  based  on  the  target  truth. 

The  target  motion  mode,  moving  or  stopped,  is  modeled  according  to  a  discrete -time  Markov 
chain  with  state  dependent  transition  probabilities,  as  illustrated  in  Figure  4. 
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Figure  4:  Discrete-time  Markov  chain  modeling  target  motion 

At  any  time,  a  target  can  be  in  one  of  the  two  motion  states:  moving  or  stopped.  The  target  state 
transitions  occur  at  times  4=  kS,  k=  1,2,....,  where  8  is  the  time  increment.  The  transition 
probabilities  Py  are  state  dependent,  in 

particular  PtJ  =  Prob{mxt  state  is  j  |  current  state  is  /}  for  i,  j  e  {m,  5}  ,  where  m  and  5  denote 
moving  and  stopped  respectively. 


By  specifying  the  average  number  of  the  targets  stopped  in  the  steady  state  in  a  scenario,  we 
derive  the  transition  probabilities  for  the  Markov  chain  model  that  are  appropriate  for  the 
scenario.  In  particular,  the  average  number  of  stopped  targets  in  the  steady  state  is  given  by 

fiP 

average  number  of  targets  stopped  = - — — ,  (68) 

ms  ^sm 

where 

.  n  is  the  number  of  targets  in  the  scenario, 

.  Pms  is  the  probability  that  a  target  will  stop  given  that  it  is  moving, 

.  Psm  is  the  probability  that  a  target  will  start  moving  given  that  it  is  stopped. 

We  select  the  desired  average  number  of  targets  stopped  by  setting 

(69) 

for  appropriate  value  of  the  scalar  a. 
The  transition  probability  Pms  satisfies 

(70) 


P  =1  -P 

ms  mm  ’ 
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where  Pmm  is  the  probability  that  the  target  will  be  moving  given  that  it  is  currently  moving.  This 
probability  is  computed  using  the  expected  time  a  target  will  be  moving  given  that  it  is  moving, 
he., 


ExpectedTimeMoving  =  {time  target  will  spend  moving  |  target  is  currently  moving}= 


1  -P_ 


8. 


where  8  is  the  time  interval  between  two  successive  transitions  of  Markov  chain  modeling  target 
motion  mode.  Using  this  relation,  we  can  see  that  the  transition  probability  Pmm  is  given  by 

_  ExpectedTimeMoving 
ExpectedTimeMoving  +  8 

To  summarize,  the  tracker  parameters  characterizing  the  target  kinematic  and  motion  models  are 
as  follows: 


1.  Root  mean  square  target  velocity.  This  velocity  is  used  to  compute  the  process  noise 
covariance  for  the  filter  corresponding  to  the  “move”  mode  over  the  time  increment  A  Time, 
as  follows: 

ProcessNoiseVariance  =  ( RootMeanSquare  V elocity  ■  ATime)1 .  ^2) 

The  process  noise  variance  for  the  filter  corresponding  to  the  “stop”  mode  is  zero  (which 
follows  from  the  preceding  formula  with  zero  root  mean  square  velocity). 

2.  Maximum  variance.  This  is  the  maximum  error  variance  allowed  before  a  target  track  is 
modeled  as  being  dropped. 

3.  Expected  time  moving.  This  is  the  expected  time  a  target  will  be  moving  given  that  the  target 
is  currently  moving.  It  is  used  for  estimating  the  probability  of  the  target  of  transitioning 
from  the  “move”  to  the  “move”  state  as  given  in  Equation  (71). 


Sensor  Model  Parameters 

We  have  modeled  two  sensor  modes,  MTI  and  FTI.  The  MTI  sensor  mode  can  detect  moving 
targets  only,  while  the  FTI  sensor  mode  can  detect  stationary  targets  only.  Both  sensor  modes 
are  characterized  by  the  following  parameters: 

1.  Detection  probability.  For  each  sensor  mode,  the  detection  probabilities  are  currently  fixed  to 
a  constant  for  all  targets;  however,  the  test  setting  allows  one  to  model  the  scenarios  where 
these  probabilities  are  target  dependent.  The  FTI  detection  probability  depends  on  clutter  and 
the  number  of  successive  looks.  We  use  values  (see  Error!  Reference  source  not  found.) 
corresponding  to  moderate  clutter  and  coarse  image  processing  (a  single  look). 

2.  Measurement  error  variance.  For  each  sensor  mode,  the  sensor  measurement  errors  are 
assumed  to  be  Gaussian  random  variables  with  zero  mean  and  unit  standard  deviation. 

3.  Dwell-time.  This  is  the  time  a  sensor  takes  to  collect  data  in  a  particular  mode. 


Dynamic  Programming  Parameters 

The  parameters  used  in  the  dynamic  programming  formulation  of  the  move-stop  tracking 
problem  are  the  following: 
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1 .  Target  priority  values. 

2.  Discount  factors.  There  are  two  discount  factors,  one  used  in  the  reward-based  fonnulation, 
and  the  other  used  in  the  cost-based  fonnulation. 

3.  Desired  target  error  value.  For  each  target  /,  a  desired  position  error  variance  Gt  is  specified, 
which  is  used  to  define  the  goal  region  for  the  kinematic  state  of  target  i.  The  goal  region  is 
{Sj  |  Si  <Gj  }. 

4.4.2  Simulation  Results 

Our  simulations  are  generated  using  the  tracker  and  sensor  parameter  values  given  in  Table  1 
and  Table  2,  respectively.  The  parameter  values  used  in  the  dynamic  programming  formulation 
of  the  SRM  problem  are  listed  in  Table  3. 


Table  1:  Tracker  model  parameter  values  used  in  our  simulations 


Root  mean  square  target  velocity 

10  meters  per  second 

Maximum  variance 

2500  square  meters 

Expected  time  moving 

1  minute 

Table  2:  Sensor  model  parameter  values  used  in  our  simulations 


MTI  mode 

FTI  mode 

Detection  probability 

0.9 

0.9 

Standard  deviation  of 
measurement  error 

1  meter 

1  meter 

Dwell-time 

0.1  second 

10  seconds 

Table  3:  Dynamic  programming  parameter  values  used  in  our  simulations 


Target  priority  value 

1  (for  all  targets) 

Target  goal  state 

5  meters  (for  all  targets) 

Precision  maximizing  SRM  discount  factor, 

a 

0.65 

Error  minimizing  SRM  discount  factor,  y 

1 

We  next  present  the  simulation  results  obtained  for  the  problem  of  tracking  50  targets  with  a 
single  sensor.  We  have  four  truth  scenarios  for  the  target  motion  that  differ  in  the  average 
number  of  targets  stopped  in  the  steady  state.  In  particular,  the  average  number  of  targets 
stopped  is  varying  across  the  values  10,  20,  30,  and  40.  For  each  of  these  values,  the  transition 
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probabilities  Psm  and  Pss  for  Markov  chain  modeling  target  motion  are  computed  according  to 
Equations  (68)  and  (69)  with  n= 50.  Table  4  shows  the  relations  between  Pms  and  Psm  for  the  four 
truth-scenarios. 


Table  4:  The  relation  for  the  transition  probabilities  Psm  and  Pms  as  the  average  number  of 
targets  stopped  (in  the  steady  state)  takes  values  10,  20,  30,  and  40 


Average  number  of 

Psm  relation  with  Pms 

targets  stopped 

(c.f  Eq.  68) 

10 

P  =AP 

sm  ms 

20 

P  =-P 

sm  ^  ms 

30 

P  =-P 

sm  ^  ms 

40 

P  =X-P 

sm  ^  ms 

In  all  scenarios,  the  targets  have  the  same  priority  value  (see  Table  3),  and  the  tracking  time  is  10 
minutes. 

4.4.2. 1  Target  Truth  and  SRM  Control  Plots 

In  this  section,  we  present  the  plots  of  the  control  decisions  for  the  precision  maximizing  SRM, 
the  error  minimizing  SRM,  and  the  entropy-based  SRM.  The  SRM  decisions  are  given  for  the 
target  motion  scenarios  where  the  average  number  of  stopped  targets  is  10,  20,  30,  and  40. 


Control  Decisions  for  the  Precision  Maximizing.  SRM 

The  following  Figure  5,  Figure  6,  Figure  7,  and  Figure  8  show  the  true  target  motion  and  the 
control  decisions  of  the  precision  maximizing  SRM  algorithm  for  the  scenarios  with  10,  20,  30, 
and  40  targets  stopped  on  average,  respectively.  The  control  decisions  correspond  to  a  typical 
sample  path  generated  by  the  algorithm. 
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True  Target  Motion  (Blue  =  Moving,  Red  =  Stopped) 
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Figure  5:  The  plots  show  results  for  the  precision  maximizing  SRM  when  10  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark 
shade)  indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a 
target  is  stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is 
moving  and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding 
to  the  precision  maximizing  SRM  for  the  scenario  with  10  targets  stopped  on  average.  The 
bars  indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates 
the  sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the 
MTI  and  FTI  sensor  modes,  respectively.  For  example,  target  46  is  observed  at  sensor 
dwell  1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute 
tracking,  and  the  MTI  mode  is  used  in  each  dwell. 
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True  Target  Motion  (Blue  =  Moving,  Red  =  Stopped) 
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Figure  6:  The  plots  show  results  for  the  precision  maximizing  SRM  when  20  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark 
shade)  indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a 
target  is  stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is 
moving  and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding 
to  the  precision  maximizing  SRM  for  the  scenario  with  20  targets  stopped  on  average.  The 
bars  indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates 
the  sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the 
MTI  and  FTI  sensor  modes,  respectively.  For  example,  target  42  is  observed  at  sensor 
dwell  1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute 
tracking,  and  the  MTI  mode  is  used  in  each  dwell. 
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Figure  7:  The  plots  show  results  for  the  precision  maximizing  SRM  when  30  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is  moving 
and  target  2  is  stopped.  The  bottom  plot  shows  the  the  control  decisions  corresponding  to 
the  precision  maximizing  SRM  for  the  scenario  with  30  targets  stopped  on  average.  The 
bars  indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates 
the  sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the 
MTI  and  FTI  sensor  modes,  respectively.  For  example,  target  37  is  observed  at  sensor 
dwell  1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute 
tracking,  and  the  MTI  mode  is  used  in  each  dwell. 
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Figure  8:  The  plots  show  results  for  the  precision  maximizing  SRM  when  40  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark 
shade)  indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a 
target  is  stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is 
moving  and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding 
to  the  precision  maximizing  SRM  for  the  scenario  with  40  targets  stopped  on  average.  The 
bars  indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates 
the  sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the 
MTI  and  FTI  sensor  modes,  respectively.  For  example,  target  41  is  observed  at  sensor 
dwell  1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute 
tracking,  and  the  MTI  mode  is  used  in  each  dwell. 
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As  indicated  by  the  preceding  plots,  the  precision  maximizing  SRM  uses  MTI  mode  exclusively. 
This  is  to  be  expected,  since  long  FTI  dwells  are  not  beneficial  for  this  SRM.  More  specifically, 
this  SRM  is  maximizing  the  overall  time  the  target  errors  are  below  the  desired  error  goals. 
During  the  long  FTI  dwell,  the  errors  of  moving  targets  increase  well  above  the  desired  error. 
Therefore,  the  time  the  target  error  goals  are  attained  during  a  single  FTI  mode  is  shorter  than  the 
time  the  error  goals  are  attained  during  a  sequence  of  100  MTI  dwells.  Being  farsighted,  the 
precision  maximizing  SRM  captures  the  trade-off  between  the  benefits  resulting,  respectively, 
from  one  long  FTI  dwell  and  a  sequence  of  100  short  MTI  dwells. 

Note  that  the  algorithm  uses  the  MTI  mode  on  stopped  targets  to  check  whether  the  target  is  still 
stopped  or  it  has  started  moving. 


Control  Decisions  for  the  Error  Minimizing  SRM 

The  following  Figure  9,  Figure  10,  Figure  11,  and  Figure  13  show  the  true  target  motion  and  the 
control  decisions  of  the  error  minimizing  SRM  algorithm  for  the  scenarios  with  10,  20,  30,  and 
40  targets  stopped  on  average,  respectively.  The  control  decisions  correspond  to  a  typical  sample 
path  generated  by  this  SRM  algorithm. 
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Figure  9:  The  plots  show  results  for  the  error  minimizing  SRM  when  10  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is  moving 
and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
error  minimizing  SRM  for  the  scenario  with  10  targets  stopped  on  average.  The  bars 
indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the 
sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI 
and  FTI  sensor  modes,  respectively.  For  example,  target  14  is  observed  at  sensor  dwell 
1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute  tracking, 

and  the  MTI  mode  is  used  in  each  dwell. 
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Figure  10:  The  plots  show  results  for  the  error  minimizing  SRM  when  20  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is  moving 
and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
error  minimizing  SRM  for  the  scenario  with  20  targets  stopped  on  average.  The  bars 
indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the 
sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI 
and  FTI  sensor  modes,  respectively.  For  example,  target  7  is  observed  at  sensor  dwell  1,000 
in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute  tracking,  and  the 

MTI  mode  is  used  in  each  dwell. 
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Figure  11:  The  plots  show  results  for  the  error  minimizing  SRM  when  30  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is  moving 
and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
error  minimizing  SRM  for  the  scenario  with  30  targets  stopped  on  average.  The  bars 
indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the 
sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI 
and  FTI  sensor  modes,  respectively.  For  example,  target  41  is  observed  at  sensor  dwell 
1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute  tracking, 

and  the  MTI  mode  is  used  in  each  dwell. 
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Figure  12:  The  plots  show  results  for  the  error  minimizing  SRM  when  40  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is  moving 
and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
error  minimizing  SRM  for  the  scenario  with  40  targets  stopped  on  average.The  bars 
indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the 
sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI 
and  FTI  sensor  modes,  respectively.  For  example,  target  49  is  observed  at  sensor  dwell 
1,000  in  MTI  mode.  There  are  6,000  sensor  dwells  scheduled  during  10  minute  tracking, 

and  the  MTI  mode  is  used  in  each  dwell. 
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Similar  to  the  precision  maximizing  SRM,  the  error  minimizing  SRM  uses  MTI  mode 
exclusively  as  indicated  in  the  preceding  plots.  Again,  this  is  to  be  expected,  since  this  SRM  is 
minimizing  the  discounted  sum  of  the  target  errors  above  the  level  of  the  desired  error  goal 
accumulated  over  the  tracking  time.  The  total  target  error  accumulated  during  a  long  FTI  dwell  is 
larger  than  the  target  error  accumulated  during  a  sequence  of  100  MTI  dwells.  Therefore,  the 
FTI  mode  is  less  beneficial  for  the  cost-rollout  algorithm  than  the  MTI  mode.  Being  farsighted, 
the  error  minimizing  SRM  can  anticipate  the  benefits  resulting  from  a  long  FTI  dwell  and  a 
sequence  of  100  short  MTI  dwells,  and  can  select  a  control  that  is  more  beneficial  in  a  long  run. 

Note  that,  similar  to  the  precision  maximizing  algorithm,  the  error  minimizing  uses  the  MTI 
mode  for  stopped  targets  to  check  whether  the  target  is  still  stopped  or  it  has  started  moving. 


Control  Decisions  for  the  Entropy-Based  SRM 

The  following  Figure  13,  Figure  14,  Figure  15,  and  Figure  16  show  the  true  target  motion  and 
the  control  decisions  of  the  myopic  entropy-based  SRM  for  the  scenarios  with  10,  20,  30,  and  40 
targets  stopped  on  average,  respectively.  The  control  decisions  correspond  to  a  typical  sample 
path  generated  by  this  algorithm. 
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Figure  13:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  10  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark 
shade)  indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a 
target  is  stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  1,000,  target  1  is 
moving  and  target  2  is  stopped.  The  bottom  plot  shows  the  control  decisions  corresponding 
to  the  entropy-based  SRM  for  the  scenario  with  10  targets  stopped  on  average.  The  bars 
indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the 
sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI 
and  FTI  sensor  modes,  respectively.  For  example,  target  37  is  observed  at  sensor  dwell 
1,000  in  MTI  mode.  During  10  minute  tracking,  6,000  sensor  dwells  are  scheduled  and  the 

MTI  mode  is  used  in  each  dwell. 
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Figure  14:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  20  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  500,  target  1  is  stopped 
and  target  4  is  moving.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
entropy-based  SRM  for  the  scenario  with  20  targets  stopped  on  average.The  bars  indicate 
which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the  sensor  mode 
used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI  and  FTI 
sensor  modes,  respectively.  For  example,  target  45  is  observed  at  sensor  dwell  500  in  MTI 
mode  and  target  49  is  observed  at  sensor  dwell  2,052  in  FTI  mode.  During  10  minute 
tracking,  3,546  sensor  dwells  are  scheduled  and  the  long  FTI  mode  is  used  in  25  dwells. 
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True  Target  Motion  (Blue  =  Moving,  Red  =  Stopped) 
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Figure  15:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  30  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion  The  blue  color  (dark  shade) 
indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a  target  is 
stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  200,  target  1  is  stopped 
and  target  3  is  moving.  The  bottom  plot  shows  the  control  decisions  corresponding  to  the 
entropy-based  SRM  for  the  scenario  with  30  targets  stopped  on  average.  The  bars  indicate 
which  target  is  observed  at  which  sensor  dwell,  and  the  bar  color  indicates  the  sensor  mode 
used  for  the  observation.  The  blue  and  the  red  colors  correspond  to  the  MTI  and  FTI 
sensor  modes,  respectively.  For  example,  target  29  is  observed  at  sensor  dwell  200  in  MTI 
mode  and  target  40  is  observed  at  sensor  dwell  110  in  FTI  mode.  During  10  minute 
tracking,  1,633  sensor  dwells  are  scheduled  and  the  long  FTI  mode  is  used  in  45  dwells. 
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Figure  16:  The  plots  show  results  for  the  myopic  entropy-based  SRM  when  40  targets  are 
stopped  on  average.  The  top  plot  shows  the  true  target  motion.  The  blue  color  (dark 
shade)  indicates  that  a  target  is  moving  and  the  red  color  (light  shade)  indicates  that  a 
target  is  stopped.  For  example,  at  the  time  corresponding  to  sensor  dwell  50,  target  1  is 
stopped  and  target  30  is  moving.  The  bottom  plot  shows  the  control  decisions 
corresponding  to  the  entropy-based  SRM  for  the  scenario  with  40  targets  stopped  on 
average.  The  bars  indicate  which  target  is  observed  at  which  sensor  dwell,  and  the  bar 
color  indicates  the  sensor  mode  used  for  the  observation.  The  blue  and  the  red  colors 
correspond  to  the  MTI  and  FTI  sensor  modes,  respectively.  For  example,  target  44  is 
observed  at  sensor  dwell  50  in  MTI  mode  and  target  48  is  observed  at  sensor  dwell  42  in 
FTI  mode.  During  10  minute  tracking,  504  sensor  dwells  are  scheduled  and  the  long  FTI 

mode  is  used  in  56  dwells. 
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As  seen  from  the  preceding  plots  Figure  13  through  Figure  16,  the  entropy-based  SRM  algorithm 
schedules  the  longer  FTI  mode  more  frequently  to  observe  the  stopped  targets,  as  the  number  of 
stopped  targets  increases.  In  particular: 

1 .  When  10  targets  are  stopped  on  average,  the  FTI  mode  is  not  used  in  any  dwell. 

2.  When  20  targets  are  stopped  on  average,  the  FTI  mode  is  used  in  25  dwells  out  of  the 
total  of  3546  dwells  scheduled  during  the  10  minute  tracking. 

3.  When  30  targets  are  stopped  on  average,  the  FTI  mode  is  used  in  45  dwells  out  of  the 
total  of  1633  dwells  scheduled  during  the  10  minute  tracking. 

4.  When  40  targets  are  stopped  on  average,  the  FTI  mode  is  used  in  56  dwells  out  of  the 
total  of  504  dwells  scheduled  during  the  10  minute  tracking. 

The  entropy-based  SRM  algorithm,  being  myopic,  does  not  anticipate  long  term  benefits 
resulting  from  current  control  decisions.  In  particular,  this  algorithm  selects  control  decisions 
based  on  the  benefits  of  a  single  dwell  plan.  For  such  a  short  planning  horizon,  the  FTI  mode 
may  seem  more  beneficial  than  the  MTI  mode,  which  results  in  scheduling  dwells  with  the  FTI 
mode. 

4.4.3  Time  Averaged  Mean-Square  Error  and  Fraction  of  Time  the  Error  Goals  are  Met 

Here,  we  present  our  simulation  results  for  the  farsighted  precision  maximizing  SRM  and  the 
error  minimizing  SRM  algorithms,  and  the  myopic  entropy-based  SRM  algorithm  obtained  for 
the  four  target-motion  scenarios.  We  have  4  Monte  Carlo  simulations  for  each  target-motion 
scenario  and  each  SRM  algorithm.  Note  that  the  number  of  Monte  Carlo  simulations  needed  to 
achieve  reasonably  small  95  percent  confidence  intervals  is  small  because  there  is  not  much 
variability  in  the  simulation  results.  In  particular,  all  the  variability  in  the  simulation  results  (for 
a  fixed  scenario  and  a  selected  SRM  algorithm)  comes  from  the  uncertainty  in  the  sensor 
detections  and  the  sensor  detection  probability  is  0.9  for  both  modes  (see  Table  2). 

We  present  our  simulation  results  in  terms  of  two  measures  of  performance: 

1.  Time  averaged  mean-square  error. 

2.  Average  fraction  of  time  the  target  error  goals  are  met. 

The  time  averaged  mean-square  error  is  computed  by  averaging  the  sum  of  target  errors  in  time 
over  the  number  of  targets,  over  the  tracking  time,  and  over  the  number  of  runs.  Similarly,  the 
average  fraction  of  time  the  target  error  goals  are  met  is  computed  by  averaging  the  sum  of  the 
fractions  of  time  the  target  error  goal  is  met  over  the  number  of  targets  and  over  the  number  of 
runs.  The  fraction  of  time  the  target  error  goal  is  met  is  the  total  time  (during  the  tracking)  the 
target  error  is  below  the  desired  error  goal  divided  by  the  total  tracking  time.  The  simulation 
results  for  the  time  averaged  mean-square  error  are  presented  in  Figure  17,  and  the  results  for  the 
average  fraction  of  time  the  error  goals  are  met  are  shown  in  Figure  18.  The  bars  in  the  figure 
mark  the  intervals  for  the  variability  in  the  data  with  95  percent  confidence. 
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Figure  17:  The  results  for  the  time  averaged  mean-square  error  obtained  for  the  four 
target-motion  scenarios  and  for  the  three  SRM  algorithms. 

The  four  target-motion  scenarios  correspond  to  the  average  number  of  targets  stopped  taking 
values  10,  20,  30,  and  40,  respectively.  The  three  SRM  algorithms  are  the  farsighted  precision 
maximizing  and  error  minimizing  SRM  algorithms,  and  the  myopic  entropy-based  SRM 
algorithm. 

The  simulation  results  in  Figure  17  indicate  that  the  farsighted  error  minimizing  SRM  algorithm 
maintains  better  quality  target  tracks  than  the  farsighted  precision  maximizing  and  the  myopic 
algorithms.  In  particular,  the  difference  in  the  time  averaged  mean-square  error  for  the  precision 
maximizing  SRM  and  for  the  error  minimizing  SRM  is  largest  when  most  of  the  targets  are 
moving  (see  the  results  in  Figure  17  for  10  and  20  targets  stopped  on  average).  When  most  of  the 
targets  are  moving,  the  entropy-based  algorithm  also  has  smaller  error  than  the  precision 
maximizing  algorithm.  However,  as  more  targets  are  stopped  (see  the  results  for  30  targets 
stopped  on  average),  the  entropy-based  algorithm  accumulates  more  error  than  the  precision 
maximizing  algorithm.  When  most  of  the  targets  are  stopped,  the  tracking  is  easier  because  there 
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are  fewer  constraints  on  the  sensor  resources.  In  this  case,  all  three  algorithms  perform  similarly 
(see  the  results  for  40  targets  stopped  on  average). 
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Figure  18:  The  results  for  the  average  fraction  of  time  the  target  error  goals  are  met 
obtained  for  the  four  target-motion  scenarios  and  for  the  three  SRM  algorithms. 

The  four  target-motion  scenarios  correspond  to  the  average  number  of  targets  stopped  taking 
values  10,  20,  30,  and  40,  respectively.  The  three  SRM  algorithms  are  the  farsighted  precision 
maximizing  and  error  minimizing  SRM  algorithms,  and  the  myopic  entropy-based  SRM 
algorithm. 

The  simulation  results  in  Figure  18  indicate  that  the  farsighted  precision  maximizing  SRM 
algorithm  has  the  best  performance  and  the  myopic  entropy-based  has  the  worst  performance  for 
the  fraction  of  time  the  target  error  goals  are  met.  The  difference  in  the  fraction  of  time  the  error 
goals  are  met  between  the  precision  maximizing  SRM  and  the  error  minimizing  SRM  is  the 
largest  when  most  of  the  targets  are  moving  (see  the  results  in  Figure  17  for  20  targets  stopped 
on  average).  The  difference  in  the  fraction  of  time  the  error  goals  are  met  between  the  precision 
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maximizing  and  the  entropy-based  algorithm  is  the  largest  when  most  of  the  targets  are  stopped 
(see  the  results  in  Figure  18  for  30  targets  stopped  on  average). 

Considering  the  results  for  the  precision  maximizing  algorithm  in  both  Figure  17  and  Figure  18, 
and  the  fact  that  this  algorithm  has  scheduled  the  MTI  mode  for  each  dwell,  it  seems  that  the 
precision  maximizing  algorithm  keeps  the  target  error  below  the  desired  error  goal  for  as  many 
targets  as  possible,  resulting  in  a  very  small  error  for  some  targets  and  very  large  error  on  the 
other  targets.  Consequently,  the  average  target  error  is  large,  but  the  overall  time  the  error  goals 
are  met  is  also  large. 

Considering  the  results  for  the  error  minimizing  algorithm  in  both  Figure  17  and  Figure  18,  we 
see  that  this  algorithm  maintains  overall  small  errors  on  all  targets,  but  does  not  necessarily 
maintain  the  errors  smaller  than  their  corresponding  error  goals.  Consequently,  this  algorithm 
has  small  average  target  error,  but  the  overall  time  the  error  goals  are  met  is  not  always  the  best. 

The  performance  of  the  myopic  entropy-based  algorithm  is  not  satisfactory  in  either  of  the  two 
measures.  This  algorithm  schedules  the  longer  FTI  mode  more  frequently  to  observe  the  stopped 
targets  as  the  number  of  stopped  targets  increases  (see  Figure  16).  This  results  in  substantially 
less  time  spent  observing  moving  targets,  and  the  corresponding  track  errors  far  exceed  the 
goals.  Consequently,  the  overall  target  error  is  large  and  the  time  target  error  goals  are  met  is 
short. 

4.5  CONCLUSIONS 

We  have  developed  novel,  computable,  farsighted  SRM  algorithms  for  move/stop  tracking  with 
a  multi-mode  sensor.  This  particular  sensor  management  problem  is  challenging  because  of  the 
complex  target  dynamics  and  the  variable  duration  of  sensor  actions.  We  have  evaluated  the 
farsighted  algorithms  against  a  myopic,  entropy-based  SRM  algorithm.  Our  simulation  results 
indicate  that  the  farsighted  algorithms  have  promising  behavior.  For  example,  the  farsighted 
precision  maximizing  SRM  results  in  a  policy  that  adaptively  adjusts  the  frequency  with  which 
moving  and  stopped  targets  are  observed  in  a  manner  that  results  in  better  tracks  than  the  myopic 
entropy-based  sensor  manager.  We  attribute  this  to  the  capability  of  the  farsighted  algorithm  to 
adapt  the  target  revisit  rates  appropriately. 

We  believe  that  our  simulation  results  are  important  indicators  that  the  farsighted  algorithms  are 
better  than  myopic  ones,  especially,  for  SRM  problems  with  complex  dynamics  (e.g.,  when 
targets  are  randomly  starting  and  stopping  and/or  sensor  actions  have  significantly  different 
durations). 
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5  COMPUTABLE  OPTIMAL  STRATEGIES 


In  this  section,  we  investigate  computable  optimal  strategies  for  sensor  resource  management 
(SRM)  problems.  SRM  problems  can  be  formulated  as  Markov  decision  processes  (MDPs) 
which  in  turn  can  be  solved  optimally,  at  least  in  principle,  by  numerical  dynamic  programming 
algorithms.  Since  the  processing  time  and  memory  required  to  solve  the  dynamic  program 
associated  with  the  MDP  in  SRM  is  exponential  in  the  number  n  of  targets  being  sensed, 
optimal  numerical  solutions  of  the  general  SRM  problem  are  intractable  for  large  n  of  interest 
(e.g.,  for  hundreds  of  targets).  However,  there  are  classes  of  problems  such  as  multi-armed 
bandit  problems  which  have  optimal  strategies  in  terms  of  maximizing  a  priority  index  rule 
computed  independently  for  each  target.  These  strategies  are  computationally  tractable,  and  can 
be  used  as  subroutines  in  computing  approximate  optimal  strategies  of  more  realistic  problems. 
Sometimes  priority  index  solutions  can  be  obtained  for  problems  which  aren't  multi-armed 
bandit  problems.  For  example,  it  is  shown  in  [7]  that  a  priority  index  solution  based  on  the 
conditional  probabilities  of  each  target  being  a  threat  is  optimal  for  a  finite  horizon  classification 
problem. 

This  report  describes  a  sufficient  condition  to  use  for  checking  whether  a  given  strategy,  such  as 
one  given  by  a  priority  index  rule,  is  optimal.  The  sufficient  condition  applies  to  finite  horizon 
MDPs  with  terminal  reward,  and  can  be  used  to  show  the  optimality  of  the  search  strategy  in  [7] 
and  in  some  other  examples  that  we  will  describe.  The  report  is  organized  as  follows: 

•  Section  5.1  formulates  the  sufficient  condition  for  a  general  MDP  in  terms  of  strategy 
sets.  This  section  defines  strategy  set,  terminal  optimality  of  a  strategy  set,  deferrable 
decisions,  and  commutative  transition  probabilities.  If  a  strategy  set  is  terminally  optimal, 
has  deferrable  decisions,  and  the  MDP  has  commutative  transition  probabilities,  then  the 
strategy  set  is  optimal.  The  section  specializes  the  general  result  to  symmetric  MDP 
problems,  which  are  given  in  some  of  the  examples  later  in  the  section. 

•  Section  5.2  describes  a  subclass  of  MDP  that  corresponds  to  many  SM  problems,  namely 
the  class  of  MDP  corresponding  to  one  sensor  observing  n  non-interacting  targets  one  at 
a  time.  This  section  specializes  the  definitions  and  results  of  Section  5.1  to  this  subclass 
of  SM  MDP  problems. 

•  Section  5.3  applies  the  sufficient  condition  to  show  the  optimality  of  a  strategy  for  a  class 
of  MDP  (which  is  a  subclass  of  the  SM  MDP  problems  of  Section  5.2).  This  class  of 
MDP  is  characterized  by  n  non-interacting  Markov  chains  which  have  an  ordered  state 
space.  In  particular,  the  chains  can  only  transition  at  most  one  step  in  one  direction  and 
include  birth-death  processes  as  a  special  case. 

•  Section  5.4  applies  the  sufficient  condition  to  show  the  optimal  strategy  for  a  class  of 
symmetric  binary  classification  problems. 

•  Section  5.5  applies  the  sufficient  condition  to  show  the  optimal  strategy  for  a  class  of 
search  problems.  This  strategy  was  developed  previously  in  [7]. 
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5.1  SUFFICIENT  CONDITION 


We  will  denote  a  MDP  with  terminal  reward  by  the  tuple  (X,  U,  pu  ,R,T )  where  X  denotes  the 
state  space  of  the  Markov  chain,  U  denotes  the  set  of  possible  decisions,  {pu  :  u  e  U  j  is  the 
collection  of  transition  probabilities  parameterized  by  the  decision  u  ,  R  is  the  terminal  reward 
function  R  :  X  — >  R  ,  and  the  integer  T  is  the  terminal  time.  We  will  assume  that  X  is  discrete 
and  U  is  finite. 

If  X(t),  0  <t  <T  is  the  Markov  process  with  decisions  U{t),  0  <  t  <  T  -  \ ,  and  terminal  reward 
R(x(T )),  the  MDP  problem  is  to  select  U  to  maximize  the  expected  value  E{r{x{T ))}  of  the 
tenninal  reward.  We  assume  that  the  decision  U{t)  depends  only  on  A" (0 ),...,  X(t )  and  that 

Pr{x(t  +  \)  =  %\X(t)  =  x,U{t)  =  u}  =  pu  (£  |  x).  (72) 

The  dynamic  programming  equations  for  the  optimal  reward  function  for  the  MDP 
(X,U  ,pu  ,R,T )  are  given  as  follows.  The  terminal  condition  is 

V(x,  T)  =  R(x).  (73) 


The  recursion  is 


V(x,t)  =  max\Vu(x,t)}  (74) 

u 

for  times  0  <  t  <  T  - 1 ,  where  we  define 

r,M  “Zfif.f+OAfih  (75) 

Definition  1.  Suppose  that  the  MDP  (X,  U,/?;(,i?,r)  has  the  probability  transition  functions 
pu(g\x)  for  j,^eX,  u  e  U  ,  and  terminal  reward  i?(x)  for  x  e  X  .  If  ®(x,  t)  a  U  for  each 
xeX  and  0  <  t  <  T  - 1 ,  we  say  that  ®  is  a  strategy  set  for  the  MDP. 

Definition  2.  If  O  is  a  strategy  set  for  the  MDP  (X,  U  ,pu,R,T),  and  if  for  each  xeX  and  each 
u  e  ®(x, T -\), 


X  ^(-v)  Pu  (y  I x) = max  X  R(y)Pr  C y  I  x\  (76) 

y  y 
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we  say  that  the  strategy  set  ©  is  terminally  optimal  for  the  MDP. 


Definition  3.  If  ©  is  a  strategy  set  for  the  MDP  (X,  U ,pu,R,T),  and  if  for  each  t  such  that 
0  <  t  <  T  -1 ,  each  x  e  X  ,  and  for  all  u,v,y 

u  e  ©(x,t),  Vv  (x,t)  >  Vu  (x,t),  and  pv(y  |  x)  >  0  implythat  u  e  ©(  y,t  +  \),  (77) 

then  we  say  that  decisions  are  deferrable  in  the  strategy  set  ©  . 

Remark  1.  Definition  3  gives  conditions  under  which  if  u  is  in  the  decision  set  at  the  current 
time  but  a  different  decision  v  is  made,  then  u  is  still  in  the  decision  set  at  the  next  time.  This 
condition  allows  using  an  interchange  argument  to  prove  the  optimality  of  the  decision  set 
(Theorem  1).  Unfortunately,  Definition  3  is  too  hard  to  check  in  practice.  However,  it  is  implied 
by  various  stronger  conditions  that  are  easier  to  check.  For  example,  if  for  each  t  such  that 
0  <  t  <  T  - 1 ,  and  each  x  e  X , 

and  pv(y  |x)>0  implythat  u  e  ©( y,t  +  l),  (78) 

then  decisions  are  deferrable  in  the  strategy  set  ©  .  This  condition  is  stronger  than  the  definition, 
since  Vv  (x,  t )  >  Vu  (x,  t )  obviously  implies  that  v*u  .  At  the  end  of  this  section  we  prove  another 
stronger  condition  for  problems  with  symmetry. 

Definition  4.  We  say  that  the  probability  transition  functions  pu  (c  |  x)  are  commutative  if  for 
all  u,  v  e  U  , 


I  I  *)=  I  riPuiv  I  *)  (79) 

//  Ti 


for  all  x,£  e  X  . 

Theorem  1.  Suppose  that  ©  is  a  strategy  set  for  an  MDP  (X,  \J,pu,R,T)  with  commutative 
transition  probability  functions  pu ,  such  that  ©  is  terminally  optimal  and  decisions  in  ©  are 
deferrable.  Then  the  strategy  set  ©  is  optimal  in  the  sense  that  any  decision  U(t)  e  ©(x(t),  t)  for 
0  <  t  <  T  - 1 ,  is  an  optimal  decision  for  (X,  U,  pH,R,T). 

Proof.  Define  ©  *  (x,  t )  to  be  the  set 

©  *  (x,t)=  L  :  Vu(x,t)  =  max  Vw  (x,  t )|  (80) 

v  w  3 
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Thus,  ©  *  (x,  t )  is  the  set  of  optimal  strategies.  We  want  to  prove  that 

©(x,  t)c  ©*(x,  t).  (81) 

The  terminal  optimality  condition  is  equivalent  to 

0(x,r-l)c=0‘(jc,r-l).  (82) 

Thus,  assume  that  ©(x,  t  + 1)  <=  ©  *  (x,  t  + 1)  is  true  for  ©  and  prove  (81)  from  it.  Suppose  that 
we©(x,t)  and  w£©*(x,t).  Clearly  ©*(x,t)^0  and  there  is  ve©*(x,t)  such  that 
Vv  (x, /)  >  Vu  (x,  i) .  The  condition  that  decisions  in  ©  are  deferrable  implies  that 
we©(l(t  +  l),t  +  l)  where  x(t  + 1 )  results  from  using  U(t)  =  v .  The  induction  hypothesis 
implies  that 


©(X(t  + 1),  f  + 1)  <=  ©  *  (X(t  + 1  ),t  + 1),  (83) 

so  that  u  e  ©*(X(t  +  l),t  +  l)  and  U{t)  =  v,U(t  +  1)  =  u  are  optimal  decisions.  We  now  can  use 
the  commutativity  of  the  transitions  pw  to  show  that  the  decisions  U(t)  =  u,U(t  + 1)  =  v  have  the 
same  expected  value  and  must  be  optimal  too.  Thus,  u  is  optimal,  contrary  to  assumption  and 
we  must  have  u  e  ©*(x,t) . 

To  complete  the  proof,  we  note  that  starting  from  X[t),  if  X{l  +  2)  is  the  state  resulting  from 
U(t)=v,U(t  +  l)  =  u  and  x{t  +  2)  is  the  state  resulting  from  U(t)  =  u,U(t  +  l)  =  v,  then 
commutativity  implies  that  X{t  +  2)  and  X(t  +  2)  have  the  same  distribution.  By  assumption 
(induction)  the  decisions  U(t)  =  v,  U(t  + 1)  =  u  are  optimal  and  the  value 

V(x(t),  t )  =  E {V{x{t  +  2),  t  +  2)  |  X(t)}.  (84) 


Commutativity  implies  that 

E{v{x{t  +  2\t  +  2)\  X{t))  =  E {c(x(t  +  2),t  +  l)\  x{t)\  (85) 

which  implies  that  U(t)  =  u,U(t  +  l)  =  v  must  also  be  optimal  decisions.  Q.  E.  D. 

Remark  2.  If  ©*(x,f)  is  the  optimal  strategy  set  for  (X,  U ,pu,R,T)  as  defined  in  (80),  then  ©* 
is  necessarily  terminal  optimal.  It  also  necessarily  satisfies  the  condition  for  deferrable  decisions, 
simply  because  the  hypothesis  of  the  condition, 
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ueO*(x,t),  Vv  (x,t)>  Vu  (x,  t), 


(86) 


is  always  false.  As  we  indicated  in  Remark  1,  this  condition  is  difficult  to  check  in  practice,  but 
we  can  replace  it  with  stronger  conditions  which  don't  refer  to  the  optimal  reward  function.  With 
these  stronger  conditions,  it's  important  to  have  the  third  condition,  commutativity  of  the 
transition  probabilities,  to  prove  the  optimality  of  a  proposed  strategy  set. 

To  conclude  this  section  we  will  prove  another  stronger  condition  for  deferrable  decisions  in  O 
based  on  symmetric  MDP  problems. 


Definition  5.  The  MDP  (X,  U ,pu,R,T)  is  symmetric  if  for  some  n 


X  =  X", 

U  =  {l, ...,«}, 

p„(i){ny\™)= Pi(y\x) 


and 


R(x)  =  R{pc) 


where  n  pennutes  the  components  of  x,y  ,  namely 

m  =  (Ar(l W(h))> 

for  any  permutation  n  of  and  all  jeX". 


(87) 


(88) 


(89) 


Lemma  1.  Let  V{x,t)  be  the  value  function  for  MDP  (X,U ,pu,R,T)  which  is  symmetric.  Then 
V{x,t)  is  symmetric  in  xj  for  each  0  <t<T.  That  is,  if  n  is  a  permutation  of  1 ,...,«  and 
xx  =  (x4l),...,x4n)),then 


V(x,t)  =  V{nx,t). 


(90) 


In  addition, 

V,{xj)=VJ[(l){7ix,t). 


(91) 
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Proof.  Let  x  denote  a  vector  in  X"  =  X  .  The  dynamic  programming  equations  for  the  MDP  are 
given  as  follows.  The  terminal  condition  is 

V{x,T)  =  R{x).  (92) 

The  recursion  is 

V{x,t)=  max{f)(x,t)}  (93) 

i 


for  times  0  <  t  <  T  - 1 ,  where 


V,{x,t)  :=  J^V(y,t  + 1  )p,(y  |  x). 


(94) 


Clearly, 


V{nx,  T )  =  R{nx,  T)  =  R(x,  T )  =  V(x,  T ). 


(95) 


Assume  that 

V{x,t  + 1)=  V{m,t  + 1) 

for  all  x,  n  and  prove  it  for  t .  By  definition, 

Vi{x,t)  =  Yy{y,t  +  \)pl{y\x) 

y 

and  by  symmetry  assumptions, 

v’t  +  X)p,(y  I  x)=Y/(ny,t  + 1  )p„(i){ny  I  nx) 

y  y 

y 

Thus,  it  follows  that 

Vi(x,t)  =  V4i)(7a,t). 


(96) 


(97) 


(98) 


(99) 
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For  any  permutation  <f>  and  any  y  it  is  always  true  that 


V{y,t)=  m  ax{^(/)(y,t)}.  (100) 

In  particular,  it  is  true  for  (j)  =  n  and  v  =  tdc  .  Thus, 

V{m,t)=  ma 

=  max{f((x,t)}  (101) 

i 

=  V(x,t). 

This  completes  the  induction  and  the  proof.  Q.  E.  D. 

Proposition  1.  Suppose  that  the  MDP  (X,  U ,pti,R,T)  is  symmetric.  Then  if 

u  e  0(jc,t),  xv  *  xu  and  pv(y  |  x)  >  0  imply  that  «eO(y)  +  l),  (102) 

then  decisions  are  deferrable  in  ®  . 

Proof.  Suppose  that  u  e  ®(x,/),  Vv(x,t)>  Vu(x,t)  and  pv(y  |  jc) >  0  .  Because  of  the  symmetry 
assumption 


Vv(x,t)  =  Vx(y)(nx,t)  (103) 

for  all  permutations  n .  Let  n  be  the  permutation  that  interchanges  v  and  u  .  Then  if  xv  =  xu , 
Vv(x,t)  =  Vu(x,t).  Thus,  Vv(x,t)>  Vu(x,t)  implies  that  xv  *xu.  By  the  proposition's  assumption,  it 
follows  that  u  e  ®(v,t  +  l) ,  which  proves  the  result.  Q.  E.  D. 

5.2  APPLICATIONS  TO  SRM  PROBLEMS 

Consider  the  SRM  problem  where  there  are  n  targets  and  we  can  only  observe  one  target  at  a 
time.  In  the  simplest  case,  the  decision  U{t)  to  make  at  each  time  t  is  only  which  target 
i  =  l,...,n  to  observe.  There  is  a  Markov  chain  X.{t)  corresponding  to  each  target  i,  where 
Xi{t)  represents  the  information  state  of  target  i  at  time  t .  Typically,  we  assume  that  the  chains 
vW  are  independent  and  identically  distributed,  and  that  the  selected  (i.e.,  observed)  chain 
transitions  using  p{f\x)  and  the  n- 1  unobserved  chains  transition  using  q{g  \  x).  Moreover, 
the  reward  is  typically  additive  over  the  n  targets,  namely 
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(104) 


i= 1 

The  resulting  MDP  (X,  U,  pu ,  R,  T)  has  special  structure  where 

X  =  X"  and  X  is  the  state  space  of  one  Markov  chain  Xf 
U  =  {1 

Pi  (£\x)  =  Pi  (£  |  x(. )Y\  q(Zj  |  Xj )  for  i  e  U,  x,  £  e  X"  (105) 

j*i 

R(x)  =  ^  r(x;)  for  x  eX" . 

7=1 

Remark  3.  If  5  =  |x|  is  the  number  of  states  for  each  single  Markov  chain,  then  the 

computational  complexity  of  the  dynamic  programming  solution  is  o{ns2nT).  Thus,  for  fixed  s 
and  T ,  the  complexity  is  exponential  in  n .  Furthermore,  the  memory  requirements  are 
exponential,  namely  0\s"T ).  In  some  cases  we  can  find  an  optimal  strategy  of  the  form 
U(t)e  0((X1(t),...,Xn(t)),t)  where 

®(x,t)=!  :  Mi(xi,t)  =  maxMj(xj,tfj  (106) 

This  is  what  we  call  a  priority  index  rule  strategy.  The  M(  (x,  ,t)  are  indices  that  can  be  computed 
for  each  target  with  complexity  o(s2T )  (i.e.,  equivalent  to  solving  the  dynamic  program  for  one 
target).  Thus,  the  complexity  of  the  n  target  strategy  is  o(ns2T )  rather  than  o(ns2"T ),  linear  in 
n  rather  than  exponential  in  n  . 

For  the  class  of  transition  probabilities  |  jc)  with  structure  (105),  commutativity  is 
equivalent  to  the  commutativity  of  the  transition  functions  p  and  q ,  as  the  following  simple 
result  shows. 

Proposition  2.  If  the  transition  probability  functions  pt  \  x)  defined  for  g,x  e  X"  =  X  and 
i  e  {l ,...,«}  satisfy 

Pi  (£  I  x)  =  p(Zi  I  xi  )n  I  xi )’  ( 1 07) 

j*i 

and  if  for  all  ^ ,  x1  e  X , 
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(108) 


Z^fe  I  7i)a(7i  I  xi)  =  I  7iM7i  I  xi)> 

7i  7i 


then  /7(  (^  |  x)  are  commutative  transition  probability  functions  for  g,x  e  X"  =  X  . 


Proof.  Note  that  for  i  ^  j , 


Za(£I7)a,(7I*) 

7 

= z  1 7,-  )n  i  %  i  )n*(%  i ** )  ( ] 1  °9) 

rj  k*i  k *  j 

= z<4 1  i  x/)z/;fe  i  v.-kfe  i  *,)nz*& 1  i  **) 

7;  7,  **«../  7 1 


By  assumption 

Z<4  I  >1,  )p{Vj  I  xy  )  =  Z  I  ’ll  )<klj  \xi)  ( 1 1  °) 

7y  7; 


and 


Z^fe  I  7,)a(7,  I  */)=  Z P^i  I  7,M7,  I  xi)  (! 1 !) 


Thus, 


z  i  i  x/)z^fe  1 i  T)n  z^/fe  1 7*M7*  i  **) 

7 j  Vi  k*i,j  Vt 

=  Z^fe  I  ^ kfe  I  x/)Z^fe  I  7,)a(7/  I  *,)I1Z*&  1  7*M7*  I  x*)  (! 12) 

Vj  Vi  k*i,j  Vk 

=  YjpM  i  7)  a- (7 1 4 

7 


proving  that 


Z  Pi 1  7)^ (7  I  *)  =  Z  Pj fe  I  7) A (7  I  *)  ( 1 1 3) 

7  7 


Q.  E.  D. 
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Remark  4.  Note  that  commutativity  always  holds  if  p  or  q  is  the  identity  transition 
S(^i\xj)  =  l  for  4t=xi  and  0  otherwise.  Note  that  q  =  S  is  assumed  true  in  (non-restless) 
multi-armed  bandit  problems.  Also,  classification  sensor  management  problems  often  satisfy 
q  =  8  (i.e.,  the  classification  information  state  remains  unchanged  while  the  target  is 
unobserved).  For  this  class  of  MDPs  corresponding  to  SRM  problems,  the  general  result 
(Theorem  1)  reduces  to  the  result  of  Corollary  1. 

Remark  5.  Transition  probabilities  of  the  form 

p,(flb=d£K)n«foljrd  <114> 

j*i 

and  reward  functions 

015) 

i= 1 

are  obviously  symmetric. 

Corollary  1.  Suppose  that  the  MDP  (X,  U ,pu,R,T)  has  special  symmetric  structure  where 

X  =  X"  and  X  is  the  state  space  of  one  Markov  chain  X. 

U  =  {1 

Pi  (£\x)  =  Pi  (8,  |  x  )Yl  q(Mj  I  Xj )  for  ieU,x,£e  X” 

j*i 

R(x )  =  ^  r(x;)  for  xgX”. 

i= 1 

Suppose  that  ®(x,f)  is  a  strategy  set  for  x  =  (jc1v.., xn)  such  that  i  e  ®(x,  T  + 1)  implies 

X  r(y-  )c(>/ 1  x, ) -  ^ )  >  X  Hj vj  )p(y.j  l  xj  )  ■ -  r(xj ) 

: Vi  yt 

for  all  j  ,  and 

i  e  $>(xv...,xi,...,xj,...,xn,t),  xt  *  Xj,  and  p{y.  \  Xj)>  0 

implies  that 


(116) 


(117) 


(118) 
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(119) 


Then  the  strategy  set  ®  is  optimal. 

Proof.  The  condition  on  p:{f  \  x )  implies  that  it  is  commutative.  The  second  condition  implies 
that  ®  is  terminally  optimal  for  the  terminal  reward  R(x),  and  the  third  condition  implies  that 
decisions  in  ®  are  deferrable  (Proposition  1).  The  result  follows  from  Theorem  1.  Q.  E.  D. 


5.3  BIRTH-DEATH  MDPS 

One  class  of  MDPs  for  which  the  sufficient  conditions  hold  is  analogous  to  a  birth-death  process 
that  evolves  on  an  ordered  set  and  can  only  transition  one  step  at  a  time.  Specifically,  suppose 
that  each  individual  Markov  chain  X  (/ )  has  states  in  the  nonnegative  integers  0,1,...  and  can 
transition  down  at  most  one  unit  at  a  time  (but  can  transition  up  any  number  of  units  at  a  time). 
Thus,  X.(/  + 1)>  X.{t)  —  \ .  If  the  Markov  chain  Xi(t)  is  in  state  0  ,  it  stays  there  (so  that  0  is  a 

trapping  state).  The  tenninal  reward  gives  reward  1  if  the  state  is  0  and  gives  reward  0  for  any 
other  state.  The  objective  of  the  MDP  is  to  control  as  many  of  the  individual  Markov  chains  into 
state  0  by  the  terminal  time  T  as  possible.  In  this  case,  the  sufficient  condition  shows  that  the 
greedy  solution  is  optimal— that  is,  select  i  next  for  the  smallest  nonzero  state  Xt(t) . 


Corollary  2.  Suppose  that  the  MDP  (X,  U ,pu,R,l)  has  the  special  structure 


X  =  N"  and  N0  =  {0,1,2,...} 

U  =  {1 

Pff  I  x)  =  pff  |  xt)Y\q(£j  |  Xj)  for  ieU,x,^e  N"  (120) 

j*i 

R(x)  =  Yjr(xi)  for reN", 

1=1 

where  p(v  |x)  =  0  for  y<x- 1,  /?(0|0)  =  1,  r(x,.)  =  0  if  x,.^0,  and  r(o)  =  l.  For  any  x  and 
time-to-go  z  >  0  ,  define 


®(x)  =  li :  Xj  =  min{x .} 

X  ;  >0  J 


if  somex,.  >  0, 


(121) 


and 


®(x)  =  if  Xj  =  0  for  all  j. 


(122) 
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Then  any  control  C/(?)e  ®(x(t))  is  optimal. 

Proof.  First  check  that  i  e  ®(x)  maximizes  the  marginal  reward 

0  if  x;.  >  1 

X V(y) - Axi )]p{y  t)='  c(°Il)  if  *,  =  i-  (123) 

r  [  0  if  x,.  =  0 

Suppose  x(.  >  0  and  jc.  <  x , .  If  x(  =  1 ,  then  i  has  marginal  reward  p( 0  |  xt )  and  j  has  the  same 
marginal  reward  p(()  |  x, )  =  p(()  |  x;)  if  xi=xj,  or  smaller  marginal  reward  0  if  xj>xi.  If 

x,  >  2  ,  then  i  and  j  both  have  marginal  reward  0  .  Likewise,  if  x  •  =  0  for  all  j ,  then  all  j 

have  marginal  reward  0  ,  so  that  any  selection  is  optimal.  Thus,  ®  is  terminally  optimal. 

Suppose  that  p(y  /  \  x  ,)>  0  for  x,  *  x, .  Then  we  will  show  that 

i  eO(aq,...,x„)  implies  i  e  ®(xj,...,^,...,x„).  (124) 

Suppose  i  e  ®(x1,...,x„).  Note  that  x;.  >  0  because  x,.  ^  xJ  and  there  is  at  least  one  non-zero  xk  . 
Let  p(yj  |  x  .)>  0  .  If  x,  =  0 ,  then  y.  =  0  and  none  of  the  xk  change  values,  and  therefore 

z  e®(x1,...,jy,...,xll)=®(x  1,...,xy.,...,x„)  (125) 

If  Xj  >  0,  then  x  -  >  x,  since  x  ^  x, .  Thus,  p(y /  |  x  )>  0  implies  that  vy  >  x  -  -1  and  y,  >  x;. . 
Although  y .  is  non-zero,  it  is  no  smaller  than  x, ,  and  x,  is  still  the  minimum  of  the  non-zero 
elements  of  xlv..,_y xn .  Therefore 

z'eo(x1,...j.,...,xfI)  (126) 

It  follows  that  decisions  in  ®  are  deferrable  (Corollary  1),  and  the  proof  is  complete.  Q.  E.  D. 

Example  1.  As  an  example  of  the  birth-death  MDP  we  have  defined,  consider  the  four  state 
model  of  tracking  quality  given  by  the  Markov  chains  shown  the  figure,  showing  the  possible 
transitions  for  the  case  in  which  target  i  is  observed  or  not  observed. 
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Figure  19:  In  example  1,  the  state  of  the  target  remains  unchanged  if  it  is  not  observed,  but 
the  state  may  change,  as  indicated  in  the  illustration,  if  the  target  is  observed. 

Here,  there  are  states  of  tracking  accuracy  ranging  from  undetected,  detected,  tracked,  and 
handover  (e.g.,  to  a  weapon  system).  The  model  requires  that  the  state  can  transition  at  most  one 
step  to  the  right  at  any  time  interval,  but  could  transition  to  left  all  the  way  to  undetected  (i.e., 
drop  track).  If  the  track  reaches  the  handover  state,  it  stays  there.  The  goal  of  the  example  is  to 
control  as  many  targets  into  the  handover  state  by  time  T .  The  transition  probabilities  reflect  the 
probability  of  track  error  increase  or  decrease,  depending  on  the  type  of  measurements  taken. 
The  optimal  strategy  is  to  look  at  the  target  which  is  the  state  closest  to  but  not  equal  to 
handover. 

5.4  BINARY  CLASSIFICATION  PROBLEM 

This  problem  is  to  classify  as  many  of  n  objects  over  a  finite  time  horizon  T  given  binary 
measurements  of  the  objects.  The  problem  is  interesting  because  it  is  a  partially  observed 
Markov  decision  process  (POMDP)  which  can  be  interpreted  as  an  MDP  with  a  countable  state 
space.  Suppose  there  are  n  random  variables  Zj  with  values  0  ,  1  and  that  Pr{Z(  =  l}  =  p  for  all 

i  =  \,...n.  Suppose  that  the  Y. (/ )  are  0,1  observations  of  Z;,  and  T(/)  are  independent  and 
identically  distributed  conditioned  on  Zj  with 


Pr  W<)  =  y  I  z,  =  d  =  (1  -  *)•  s,,.  +  *  •  (l  -Sr.  J  (127) 
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where  we  use  the  notation  8  _  =  1  if  y  =  z  and  0  otherwise.  We  assume  that  a  <\  .  Note  that  a 
is  the  probability  of  classification  error  for  one  measurement. 

Define  the  conditional  probability  Xt(t)  as 

Z,(f)  =  Pr{Z,=l|i’(l),...^(<)}.  (128) 


The  objective  of  the  problem  is  to  maximize  the  expected  reward 


E{i>(-W))J  (129) 

at  the  terminal  time  T ,  where  r(xi )  is  the  individual  reward 

K*<)  =  max{r(j;,l)x;  +  r(d„  0)(l  -  x;)}  (130) 

di  =0, 1 

and  r(d,z )  are  the  rewards  for  the  different  types  of  outcomes  (i.e.,  deciding  di  when  the  true 
state  of  i  is  z.  ). 

The  processes  W(t)  satisfy 


x,(o)=p 


(131) 


and  for  t  >  0 , 


X('  + 1) 


(i-OxM 

(1-2 a)Xi  ( t)  +  s 

f xAt) 

(2e-l)X,(t)+l-e 


with  probability  {\-2s)Xi[t)  +  a 
with  probability  (2a  - 1)  X.  (t)  + 1  -  a. 


(132) 


Note  that  although  X^t)  take  values  in  R  ,  there  are  only  a  countable  number  of  possible  values 
they  can  take.  Thus,  J;(t)eXc  R  where  X  is  a  countable  set.  Thus,  we  have  an  MDP 
(X,U ,pu,R,T)  where 
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(133) 


X  =  X" 


u  =  {l,. 

|  Xj)  for  i  e  U,x,£  e  X" 

j*' 

R(x)  =  ^V(x,)  forx  e  X'!. 

i=i 


where  p{g.  |  xj  is  defined  by 


f 

P 

v 


ft-*)*,- 

(l-2^)x,.  +  £ 


( 

p 

V 


£Xt 

(l£  -  l)x;  +  1 


-£ 


\ 

X,. 

J 

\ 

Xi 

) 


(l-2^)x(.  +£ 
(2s-\)xl  +  1  -s 


(134) 


and  r(x,)  is  defined  by  Equation  (130).  We  will  consider  the  special  case  for  which 
r(  1,  l)  =  r(0, 0)  =  1  and  r( 0,  l)  =  r(  1, 0)  =  0  so  that 


Kx,)  =  T  +  k-  T» 


(135) 


and  we  will  assume  that  the  prior  probability  p  =  j .  Note  that  if  p  =  \,  then 


X  = 


m  =  0,±1,±2,... 


(136) 


Proposition  3.  The  strategy  set  O  defined  by 

0(x)  =  j’  :  |x,.  -j\  =  min|xy-  -  y| |  (137) 

is  optimal  for  the  binary  classification  problem  with  r(  1,  l)  =  r(0, 0)  =  1 ,  r(0,  l)  =  r( 1, 0)  =  0,  and 
prior  probability  p  =  \  for  each  object  i . 

Proof.  The  transition  probabilities  p^f  \  x)  are  obviously  commutable  and  symmetric,  and  the 
reward  function  R(x)  is  obviously  symmetric.  Note  that 
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ix«) 


+ 


2  + 


(1  ~£)Xi 


+ 


1  + 


(] \-2s)xi  +  £ 
ax 


((l  -  2<s-)x(  +  £•) 


^-l)*.  +  l  —  s 


((2^-l)x,+l-4 


(138) 


This  simplifies  to 


X  ) ~  R(x<  )M. V/  It-)= -  i| + 1  k  - s  I + 1 K1  -  xt )- 4  (139) 


which  is  equivalent  to 


X[*W-*  (*< )]  />  (t  i  xi ) = j 


0  for  0  <  xt  <  s 
xt  -£  for  £  <  jc.  <  \ 
l-xt-£  for  j  <  jc.  <  1  -  £ 
0  for  1  -  £  <  xi  <  1 


Note  that  because 


if  m  <  0 ,  then 


and  if  m  >  0  ,  then 


Thus, 


X;  = 


1  i+(*r 


<  - 7 - 77  =  s 


"~i+(*r 


X .  >  - 7 - r  =  !-£•. 


!  +  (*) 


(140) 


(141) 
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(143) 
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(144) 


YJ[R{yl)-^{xi)]p{yl\xi) 


0  for  x.  *  j 


-  s  for  X,-  = 


In  particular, 


max 

j 


0  if  all  x.  ^  \ 
j-£  if  some  x.  =  \ 


and  if 


xt  —  j  =  min Xj  -  j. 


then 


(145) 


(146) 


i*<) 


(147) 


This  shows  that  O  is  terminally  optimal. 

To  show  that  decisions  in  ®  can  be  deferred,  suppose  that  i  e  ®(x)  so  that 

\xi-±\  =  min\xk-±\ 

and  suppose  x .  ^  xi ,  p{y  \  x  .)>  0  .  Thus, 

v  0  ~  £)x, 

"  1  (l-2  s)Xj  +  £ 


or 


SXj 

■'  (2£-l)xj+l-£ 


(148) 


(149) 


(150) 
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If  | Xj  2 1  >  | xi  _y|  -  then  it's  easy  to  see  that  |y7  - -||  and  therefore  i  e  <&{xx,...,yj,...,x^). 
However,  it's  possible  that  |  xy.  ~j\  =  \x,  -  y|  and  xt  *  x;  .If  x,  then  the  conclusion  is  not  true, 
because  one  of  the  two  values  of  yJ  is  closer  to  }  than  xi . 

However,  we  can  easily  extend  the  proposition  to  cover  this  case.  Note  that  if  | Xj  ~y\  =  \xt  - 1| , 
then  x .  =  1  —  x, .  The  classification  problem  is  invariant  under  the  transformation  x,  — »  1  -  x; ,  and 
in  particular, 


Furthermore,  if  p{y.  |  x; )  >  0 ,  then  p{\  -  yt  |  1  -  x(. )  >  0  .  It  follows  that 

^(...,x,,...,r)  =  ^(...,l-x,,...,r). 


(151) 


(152) 


As  a  consequence  of  this  and  the  symmetry  of  V  ,  we  find  that  x .  -  \  =  x,  -  j  implies  that 


V/  (x,t)  =  Vj  (x,  t  ).  (153) 

Thus,  z',y'eO(x)  implies  that  f(.(x,r)  =  Vj(x,r).  This  is  sufficient  to  extend  the  proposition 
because  if  F  (x,/)  >  F  (x,/),  then  both  x(  ^  xf  and  |x;.  -  y|  ^  |x;  - .  Thus,  we  can  apply  the 
earlier  argument  to  show  that  i  e  o(xlv..,y xn).  Consequently,  decisions  are  deferrable  in  O  . 
From  Theorem  1  it  follows  that  ®  is  optimal.  Q.  E.  D. 

5.5  SEARCH  PROBLEM 

Here,  we  investigate  the  applicability  of  the  sufficient  optimality  condition  to  the  search  problem 
considered  in  [7].  In  particular,  we  consider  the  search  problem  where  M  locations  are  given  and 
each  of  them  may  contain  an  object  of  interest.  We  are  given  a  finite-time  to  search  locations  and 
at  any  time,  we  can  search  one  location  only.  Initially,  at  time  t= 0,  for  every  location  i,  we  are 
given  a  priori  probability  x;(0)  that  location  i  contains  an  object. 

With  each  location  i,  we  associate  a  hypothesis  //„  with  H,=  1  denoting  that  there  is  an  object  at 
location  i.  With  each  location  i,  we  also  associate  a  state  x,(/)  defined  as  the  probability  that 
object  is  in  location  i,  given  the  past  measurement  collection  I(t).  In  other  words,  x;(/)  is  the 
conditional  probability  that  hypothesis  II,  is  true: 


x,(0  =  P{Ht  =  1 1 1(t)}. 


(154) 


63 


We  consider  the  independent  hypothesis  assumption,  where  the  events  that  the  hypotheses  are 
true  are  independent.  Under  this  assumption,  the  search  problem  is  multi-armed  bandit  problem. 
The  sufficient  optimality  condition  given  in  Section  5.1  applies  to  a  sub-class  of  multi-armed 
bandit  problems.  Note,  however,  that  this  optimality  condition  does  not  apply  to  the  search 
problem  considered  in  [7]  with  the  exclusive  hypothesis  assumption.  The  reason  is  that,  under 
the  exclusive  hypothesis  assumption,  the  states  of  all  of  the  locations  are  changing  after  each 
search,  thus  violating  the  basic  property  of  multi-armed  bandit  problems. 


Searching  a  location  results  in  a  measurement  Z  taking  values  0  or  1.  The  value  of  the 
measurement  is  generated  independently  at  each  stage  as  a  random  variable  with  the  following 
probability  distribution 


P{Z  =  0\Hi=0}  =  P{Z  =  \\Hi=\}  =  \-s, 
P{Z  =  \\Hi=Q)}=P{Z  =  Q\Hi=\}  =  s. 


(155) 


Let  x{t)={x\{t),...^M{t))  be  the  state  of  locations  at  time  t,  u(t )  be  the  location  searched  at  time  t, 
and  Z(t+l)=Zk  be  the  measurement  obtained.  Then,  the  new  state  of  locations  is  given  by  vector 

x(t  +  l)  =  (xi(t),...,fi(xi(t),zk),...,xM(t))  with  i  =  u(t),  (156) 


where  /)(x;(t),  zk)  is  a  Bayesian  update  of  x,(t)  with  new  measurement  z/(: 


f(xt(t),zk) 


xt(t)g(zk ) 
p{zk\i,Iit)Y 


(157) 


and 


P{zk  \  iJ(t))  =  P{Z(t  +  l)  =  zk\u(t)  =  i,I(t)} 
=  x,  (t t)g(zk  )  +  [1  -  x,  ( t)]f(zk ). 


(158) 


The  probability  of  transitioning  from  state  x{t)  to  state  ( x,  (/),... ,  f. (x .  (t),zk),... ,xM  ( / ) )  is  equal  to 
p(zk\  i,  I(t )) ,  which  does  not  depend  on  time. 


Let  X  denote  the  set  of  all  states  reachable  from  the  initial  state  x(0)  during  the  given  finite 
horizon,  i.e.,  the  set  of  all  x  such  that  x  results  from  the  initial  state  and  some  information  /  that 
can  be  collected  within  the  given  time  horizon. 

For  the  search  problem  considered  in  [7],  all  stage  rewards  are  zero,  except  for  the  final-stage 
reward,  which  is  equal  to  the  probability  of  the  most  likely  location.  In  particular, 
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R(x,  r)  =  0  for  all  x  e  X  and  z  >  0. 


(159) 


i?(x,0)  =  maxx^.  (160) 

k 

The  goal  is  find  a  search  strategy  that  maximizes  the  final  probability  of  selecting  a  correct 
hypothesis,  i.e.,  a  strategy  y  maximizing  the  expected  final-stage  reward 

E  max  xt(T).  (161) 

7  i 

The  structure  of  the  search  problem  is  identical  to  the  classification  example  considered  in 
Section  5.4.  The  only  difference  is  that  these  two  problems  have  different  final-stage  rewards.  In 
[7],  it  is  shown  that  selecting  one  of  the  two  most  likely  locations  is  an  optimal  search  strategy. 
We  will  here  show  this  result  by  using  the  sufficient  optimality  condition  of  Section  5.1. 

Let  V(x,r)  be  the  optimal  reward-to-go  x  stages  from  state  x,  and  define  functions  Vj,j=l,...,M as 
follows  (see  Vu  in  Equation  (75)  of  Section  5.1): 

Vj  (x,  r)  =  E  V(x1 , . . . ,  fj  (Xj ,  Z), . . . ,  xM ,  z  - 1) 

v.  (162) 

=  2jV(xi  ’■■■’fj(xj,zk)-,xM,z-l)-p(zk\xj,j) 


for  all  x,  z>0,  and  all  j. 

For  any  state  x  and  any  stage  r,  we  define  function  0(x,  r)  to  be  the  index  set  of  the  two  most 
likely  locations.  In  particular,  ®(x,  r)  is  given  by 

®(x,r)  =  |  xt :=  [xS(x)]1  or  x,  =  [S'(x)]9|  for  all  x  and  z  >  0. ,  (163) 

where  S(-)  is  a  nonlinear  operator  from  RM  to  RM  that  maps  vector  x  into  a  sorted  vector  x,  i.e., 

5(*)  =  (*,(!)>- ^S(M))  with  ^,1  )^-^Xs(MV  (l64) 


and  [.S'ix)],  denotes  the y'-th  component  of  vector  S(x). 

We  next  prove  that  ®  defines  an  optimal  policy  in  the  sense  that  for  all  x  e  X  and  x  >0, 


V(x,z)  =  Vi(x,z-l)  for  any  /  e®(x,  r-1). 


(165) 
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In  particular,  we  show  that  <D>  satisfies  the  sufficient  optimality  conditions  given  in  Proposition  1 
of  Section  5.1. 

Proposition:  Assume  that  the  initial  state  is  x(0)=(l/2,. . .,1/2).  Then,  the  function  ®(x, r)  has  the 
following  properties: 

1 .  For  all  xel,  x=0,  and  any  i  e  ®(x,  0) ,  we  have 


Vt  (x,  0)  =  max  V.  (x,  0), 

]  1 


(165) 


where  Vj  is  as  defined  in  (162). 

2.  For  all  xeX,  x>0,  i  e  ®(x,  r),  and  all  j  such  that  xffxj  and  p(zk\xh  u=j)> 0,  we  have 


(166) 


Proof:  We  start  by  showing  that  relation  (165)  holds.  The  proof  is  based  on  the  same  line  of 
argument  as  the  proof  of  Proposition  4  in  [7].  Note  the  symmetry  assumption  in  [7]  is  satisfied 
[cf.  Equation  (154)]. 


Let  x  e  X  be  arbitrary,  and  let  /  be  an  information  state  such  that  x  results  from  the  initial  state 
x(0)  and  information  I.  By  using  the  definition  of  Vj  [cf.  Equation  (162)],  we  have 


Vj (x, 0)  =  Yj R(xi , •  •• ,  fj 0/ ,zk),...,xM,0)p(zk  I  j, I) 

zk 

=  Y, max  {xx,...,fj (Xj , zk ),..., xM }  p(zk  \  j, I). 

zk 


(167) 


Since  reward-to-go  V(x,x)  is  invariant  under  permutations  of  x,  [i.e.,  V(Px,x)=V(x,x)  for  any 
permutation  matrix  P\,  without  loss  of  generality,  we  may  assume  that 


X\>X2>...>XM. 

(168) 

Then,  we  have 

maxjxj  ,...,fj{Xj,zk),.. 

,,xM)  =  max{f(Xl,zk),x2}, 

for  j  =  1  and  all  k, 

(169) 

ma  x{xl,...,fj(Xj,zk),.. 

•,  }  =  max  {x, ,  ffXj ,  zk )} , 

for  j  >  1  and  all  k. 

(170) 

Thus,  for y=l, 
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Vi  (x,  0)  =  2  max  {/  (x, ,  zt ),  x2  }/?(z,  1 1,  /) 

zk 

=  Z  max  (“thtt:  ’  x2 1  p(zk  1 1» J) 
=  Z  maX  {X1&(Z*  )>  *2/>(Z*  I  l  7)}‘ 


(171) 


By  using  the  definition  of p(zk\jj)  in  Equation  (158),  we  see  that 


f|(x,0)  =  Z  max  { x1g(z, ),  x2x,g(zt  )  +  x2(l-xl )/ (zk )} . 


(172) 


Similar  to  the  preceding,  for  j  >  2,  we  obtain 


(173) 


Vj  (x,  0)  =  Z  max  {*i ,  fj  (Xj ,  zk )\p(zk  |  j,  I) 

=  Z  max  |xi  ’  7^7^77}  p{zk  1  j> 7) 

=  Zmax{xljp(zJy,/),  Xjg(zk)} 

zk 

=  Z max {xixjS(zk )  +  xi G  - xj)f(zk )’  *yg(zi)}- 

zk 

By  taking  the  terms  x\Xjf[zt)  into  a  separate  summation,  we  have 

Vj (x,  0)  =  Z  max  [xlxjg(zk )  +  xj(zk ),  Xjg(zk )  +  xxXjf(zk )}  -  x,xy Z  f(zk )>  ( 1 74) 


By  using  the  relation 


Z/(z*)=1=Z&(z*)’ 


(175) 


we  obtain 
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Vj(x, 0)  =  Yjmax{xlxJg(zk)  +  xj(zk),  Xjg(zk )  +  x1xjf(zk )} -xxx^g(zk) 

zk  zk 

=  Y max {xj(zk ),  xJ(l-xl)g(zk)  +  xlXjf(zk )}. 


(176) 


Furthermore,  since  xj  ~  x2  for  all  j> 2,  we  see  that 


V(x,0)<V2(x,0)  for  all  j>2. 


(177) 


We  now  prove  that  V2(x,0)=Vi(x,0).  For  j= 2,  relation  (176)  gives 


V2 (*,  0)  =  Y max  {*i/(^ \x2(\-xx )g{zk )  +  xxx2f(zk )  }. 

zk 


(178) 


By  changing  the  variables  and  by  using  the  symmetry  assumption  on /  and  g,  we  obtain 


V2  (x,  0)  =  Y  max  {*i  fib  -  zk ),  x2(l-x,  )g(b  -zk)  +  x]x2f(b  -  zk  )} 

=  Y  max  (X1  S(zk  )»  X2  0  -  Xi  )f(zk  )  +  *1*2 g(Zk  )}• 

z* 


(179) 


By  comparing  this  with  the  expression  for  Fi(x,0)  [cf.  Equation  (172)],  we  see  that 


V2(x,0)=Vx(x,0). 


(180) 


Therefore, 


maxF(x,0)  =  Vx(x,0)  =  V2(x,0).  (181) 

i  1 

According  to  the  definition  of  ®,  we  have  ®(x,0)={l,2},  which  together  with  the  preceding 
inequality  show  that  condition  shown  in  Equation  (165)  holds. 

We  now  show  that  ®  satisfies  relation  in  Equation  (166).  Again,  without  loss  of  generality,  we 
may  assume  that  jq  >  x2  > ...  >  xM,  so  that  ®(x,r)  ={1,2}.  Then,  for  z=  1 ,  any  j  such  that  xffcx i,  and 
any  z*  with  /Tz/]  /, /)>(),  we  have 


either  fi(xhzk)<x\  or  fj{xhzk)>x\ . 


(182) 
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Hence,  x\  is  either  the  best  or  the  second  best,  implying  that 


1  e®(x1,...,fJ(xj,zk),...,xM,T- 1).  ( 1 83) 

Consider  now  the  case  where  /= 2  and  /=  1 .  Similar  to  the  preceding,  we  have  that 

either  f(x\,zk)<x2  or  f(x],zk)>x2,  (184) 

implying  that  x2  is  either  the  best  or  the  second  best.  Hence,  in  this  case,  we  have 

2e®(x1,...,fj(xJ,zk),...,xM,T-l),  (185) 


and  we  are  done. 

Consider  now  the  case  where  i= 2  and  j> 2.  We  will  show  that  for  all  j> 2  with  xffcx 2  and  any  z* 
with  p(zk\jj)>0,  we  have 


X2>fj(Xj,Zk). 


As  shown  in  Section  5.4,  the  components  ofx  have  the  following  form: 


1 


X;  = 


1  (  £  A 
1  +  '  8 


for  some  me  2, -1,0, 1,2,...}, 


vl -ej 


(186) 


(187) 


where  m  is  the  difference  between  the  number  of  measurements  of  object  j  with  outcome  1  and 
the  number  of  measurements  of  object  j  with  outcome  0.  Thus,  for  some  integers  m2  and  mh  we 
have 


x2  =■ 


1  + 


XJ=- 


1  -S 


1  + 


^  ^ 


1  -s 


(188) 


Furthermore,  since  x2>Xj  and  x^x2,  from  the  preceding  relation  it  follows  that  m2  >  in  j .  If  object  j 

is  observed  one  more  time  and  no  measurement  is  obtained,  then  the  state  Xj  of  the  object  does 
not  change  so  x2  is  still  the  second  best.  If  a  measurement  z  is  obtained,  then  the  state  of  object  j 
is  given  by 
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showing  that  X2  is  still  the  second  best.  Therefore,  xi  remains  the  second  best  for  any  observation 
outcome,  implying  that 

2  e  ®  (*! , fj  (Xj  ,zk),...,xM,T- 1),  (191) 

thus  showing  that  O  satisfies  condition  in  Equation  (166).  Q.  E.  D. 
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Abstract 

This  paper  studies  the  problem  of  dynamic  adaptive  scheduling  of  multi-mode  sensor  resources  for  the 
problem  of  classification  of  multiple  unknown  objects.  Sensor  schedules  are  adapted  based  on  the  observed 
data.  The  resulting  decision  problem  is  formulated  as  a  partially  observed  Markov  decision  problem  with  a 
large  state  space.  The  paper  describes  a  computable  lower  bound  on  the  achievable  performance  by  a  causal 
adaptive  schedule,  based  on  techniques  of  numerical  stochastic  control  and  combinatorial  optimization.  The 
lower  bound  is  based  on  an  expansion  of  the  admissible  control  space  of  the  dynamic  decision  problem,  leading 
to  a  problem  with  simpler  decision  structure  for  which  the  bounds  can  be  computed.  The  solution  of  the  relaxed 
problem  may  be  infeasible,  but  can  be  used  as  an  approximate  scheduling  technique  in  a  model  predictive  control 
framework.  The  bound  computations  are  illustrated  for  several  examples  involving  100  unknown  objects,  and 
compared  with  the  Monte  Carlo  performance  of  several  sensor  scheduling  algorithms. 


1  Introduction 

Many  modern  avionics  systems  include  multiple  sensors  as  well  as  individual  sensors  capable  of  focusing  on  different 
objects  with  different  modes.  In  order  to  achieve  an  accurate  possible  representation  of  all  objects  of  interest,  it  is 
important  to  coordinate  the  allocation  and  scheduling  of  the  different  sensors  and  sensor  modes  across  the  different 
objects  of  interest.  The  various  modes  may  be  viewed  as  multiple  resources  to  be  managed,  and  the  measurement  of 
different  objects  under  specific  modes  may  be  viewed  as  tasks  to  be  performed  with  these  resources.  The  adaptive 
sensor  management  problem  consists  of  selecting  and  scheduling  the  sensor  modes  which  are  applied  to  objects  of 
interest,  integrating  the  collected  past  information  into  the  selection  of  future  sensing  actions. 

This  paper  develops  a  model  for  a  class  of  adaptive  sensor  management  problems  involving  the  goal  of  classifying 
a  known  number  of  objects  with  unknown  type,  given  a  fixed  number  of  sensor  resources,  where  the  sensor 
performance  parameters  are  time- invariant,  so  that  the  performance  parameters  associated  with  a  sensor  observing 
an  object  with  a  given  mode  do  not  depend  on  the  time  that  sensing  activity  occurs.  This  class  of  problems 
arises  in  several  applications,  from  object  classification  in  surveillance  platforms  such  as  Joint  STARS,  dynamic 
search,  and  fault  inspection  and  isolation  in  manufacturing  systems.  In  these  applications,  inaccuracies  in  sensor 
measurements  and  variations  in  object  characteristics  and  pose  imply  that  individual  measurements  provide  noisy 
estimates  of  object  type  whose  quality  depends  on  the  specific  mode  used  by  the  sensor.  In  situations  with  multiple 
objects  and  limited  resources,  this  noisy  information  can  be  used  to  prioritize  which  objects  to  look  at,  and  to 
assign  appropriate  sensor  modes  to  the  objects. 

Because  of  the  uncertain  nature  of  the  underlying  object  types  and  the  adaptive  nature  of  the  desired  schedules, 
dynamic  sensor  management  problems  can  be  formulated  as  partially  observed  Markov  decision  problems  (POMDP) 
[2,  1,  10,  11].  As  such,  this  class  of  problems  can  be  solved  using  stochastic  dynamic  programming  [3].  However,  for 
large  numbers  of  objects,  the  required  state  space  is  very  high-dimensional,  consisting  of  the  conditional  probability 
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distributions  of  all  of  the  objects.  This  leads  to  intractable  computational  problems,  even  with  the  fastest  POMDP 
algorithms. 

Sensor  management  problems  have  been  formulated  previously  as  dynamic  optimization  problems  with  partial 
information.  The  extensive  literature  in  search  theory  [20]  deals  with  sensor  management  problems  involving 
objects  that  can  be  of  one  of  two  types  (hidden  or  found)  with  sensors  that  have  only  a  single  mode.  The  dynamic 
hypothesis  testing  problems  studied  in  [6]  also  have  objects  that  can  be  of  two  types  and  a  single  sensor  mode,  but 
generalize  results  in  search  theory  to  broader  classes  of  measurements.  More  recently,  there  has  been  work  [17] 
using  Markov  decision  problem  techniques  for  sensor  management,  particularly  techniques  based  on  the  solution  of 
multiarmed  bandit  problems.  However,  these  formulations  also  restrict  the  sensors  to  a  single  sensor  with  a  single 
mode,  and  require  an  infinite  horizon,  time-invariant  formulation. 

Because  of  the  complexity  of  general  SM  algorithms  with  multiple  sensors  and  modes,  most  practical  SM 
algorithms  are  based  on  heuristic  algorithms  based  on  information-theoretic  metrics  [5].  To  date,  there  has  been 
no  effective  approach  that  can  characterize  the  achievable  SM  performance  to  determine  whether  such  heuristic 
algorithms  are  performing  well. 

In  this  paper,  we  consider  sensor  management  (SM)  problems  involving  multiple  distributed  sensors  with  multiple 
modes  per  sensor.  This  model  is  an  extension  of  the  model  discussed  in  [7].  We  show  that  the  resulting  POMDP 
models  admit  a  lower  bound  based  on  modifying  the  constraint  structure  to  expand  the  space  of  admissible 
strategies.  The  resulting  problem  becomes  a  dynamic  optimization  problem  subject  to  expected  value  constraints, 
a  class  of  problems  recently  studied  by  Chen  and  Blankenship  in  [24].  We  develop  a  hierarchical  algorithm  that 
exploits  the  structure  of  the  resulting  relaxed  problem.  This  hierarchical  algorithm  is  based  on  the  solution  of  single 
object  POMDP  problems,  coupled  with  nondifferentiable  optimization  techniques  based  on  Lagrangian  relaxation 
[16].  The  single  object  problems  are  of  small  dimension,  and  can  be  readily  solved  using  standard  algorithms  for 
POMDPs  [10,  11,  13].  The  hierarchical  algorithm  avoids  the  exponential  growth  of  the  dimensions  of  the  resulting 
state  space  in  the  POMDP  problem  as  a  function  of  the  number  of  objects. 

The  algorithm  used  to  compute  the  bounds  can  also  be  used  as  a  suboptimal  algorithm  for  real-time  sensor 
management.  Since  the  algorithm  solves  the  SM  problem  with  an  expanded  set  of  strategies,  it  is  possible  that 
the  resulting  SM  strategies  are  not  feasible.  This  requires  modification  of  the  problem  solution,  typically  using 
a  receding  horizon  technique  similar  to  model-predictive  control.  We  describe  in  the  paper  one  such  approach  at 
this  SM  algorithm.  The  paper  includes  several  examples  where  the  lower  bound  performance  is  computed,  and 
compared  with  the  Monte  Carlo  performance  achieved  by  suboptimal  SM  algorithms. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  II  includes  the  mathematical  statement  of  the 
sensor  management  problem  for  a  single  sensor,  and  discusses  the  stochastic  dynamic  programming  algorithm  for 
this  problem.  Section  III  describes  the  modified  SM  formulation  and  the  derivation  of  the  lower  bound.  Section 
IV  describes  computation  approaches  for  evaluating  the  lower  bound.  Section  V  discusses  extensions  of  the  earlier 
results  to  multiple  sensors.  Section  VI  describes  the  numerical  experiments  and  results.  Section  VII  is  a  summary 
of  the  results  and  discussion  of  open  problems  of  interest. 

2  Problem  formulation 

In  this  section,  we  develop  a  formulation  of  the  adaptive  SM  problem  as  a  partially  observed  Markov  decision 
problem  (POMDP).  Assume  that  there  are  N  objects  of  interest  in  the  problem.  Each  object  can  belong  to 
one  and  only  one  of  K  different  classes,  and  the  object  identity  does  not  change  over  time.  Let  the  variable 
Xi  £  X  =  {1, . . . ,  K}  denote  the  true  class  of  object  i.  We  define  the  complete  (but  unknown)  system  state  as: 

x  =  (xi  x2  •••  xN)  (1) 

Since  the  identities  do  not  change  over  time,  the  complete  system  state  is  constant  over  time.  We  assume  that 
x,  are  independent  random  variables  with  values  in  the  finite  space  X.  Associated  with  each  object  i  is  a  prior 
probability  vector  tt,  (0)  which  describes  the  probability  distribution  of  the  random  variable  a That  is, 

7Ty  (0)  =  Prob^j  =  j}  (2) 

These  probability  distributions  represent  a  priori  knowledge  collected  on  each  object  before  the  start  of  the  SM 
problem. 


74 


In  order  to  obtain  information  about  the  state  of  each  object,  selected  objects  are  examined  with  different  modes 
from  different  sensors.  In  order  to  simplify  the  notation  in  the  exposition,  we  consider  the  case  of  a  single  sensor 
with  multiple  modes  m  £  {1  ,  ...,M}.  We  will  highlight  later  the  extensions  required  to  incorporate  multiple 
sensors.  The  action  to  use  a  sensor  mode  m  on  object  i  produces  an  observable  ym  in  a  finite  set  Ym,  with  a 
conditional  probability  distribution  that  depends  only  on  the  object  i,  its  type  x,  and  the  mode  to,  denoted  by 
p(ym\i,  Xi,  to).  We  assume  that  the  observation  outcomes  of  these  sensing  actions  are  conditionally  independent  of 
each  other  given  the  object  types. 

We  assume  that  obtaining  a  measurement  of  object  i  with  mode  to  requires  sensor  resources  Rim  >  0  (e.g. 
duty  cycle  of  a  radar),  which  depend  on  the  specific  object  and  mode  selected.  For  the  SM  problem  of  interest, 
the  sensor  has  a  finite  amount  of  sensor  resources  R  that  can  be  used  for  measuring  objects.  The  objective  is  to 
classify,  with  minimal  error  cost,  the  objects  after  the  sensor  resource  R  is  exhausted.  This  formulation  is  stated 
more  rigorously  below. 

Without  loss  of  generality,  we  restrict  our  attention  to  SM  strategies  that  execute  only  one  action  at  a  time. 
Such  strategies  are  optimal  in  that  they  provide  maximal  information  for  adaptation,  and  will  achieve  minimal 
error  cost.  Let  u(k)  =  (■ i(k),m(k ))  denote  the  k  +  1-th  action  (starting  at  k  =  0)  taken  by  the  sensor,  consisting 
of  measuring  object  i(k)  with  mode  m(k).  Let  U  denote  the  set  of  possible  sensor  actions,  and  let  ym(k){k)  denote 
the  measured  value  resulting  from  action  u(k)  £  U.  The  past  information  available  to  adaptively  select  u{k)  is 
I(k)  =  (u(0), ym(o)(0),  ■  •  ■ , u(k  —  l),J/m(fc-i)(fc  —  1)}-  The  SM  problem  decisions  are  selected  adaptively  until  a 
final  random  stopping  instance  T,  selected  based  on  the  information  I(T).  At  the  end  of  this  stopping  instance, 
the  information  I(T)  is  available  for  estimating  the  object  types.  For  each  object  i,  there  is  a  final  decision  i \  £  X 
based  on  I(T)  that  is  selected  to  minimize  the  expected  classification  error. 

An  admissible  adaptive  SM  policy  is  a  set  of  measurable  feedback  strategies  {7(0), . . . ,  7 (T)}  and  stopping  time 
T  such  that 


7 (k)  :  I (k)  — >  U,  k  <T 
T  :  I(T)  — >  {stop,  continue} 

7 (T)  :  I(T)  -»  XN  (3) 

Let  r  denote  the  set  of  all  admissible  SM  policies.  Since  the  observation  space  is  finite  and  the  decision  space  is 
also  finite,  T  is  a  countable  space. 

Denote  by  c(v,  x)  the  cost  of  selecting  classification  decision  v  when  the  true  object  type  is  x.  The  SM  problem 
statement  is  to  minimize  the  expected  total  classification  cost 

M 

J{l)  =  E~,(52c(vi,Xi)}  (4) 

i= 1 

over  adaptive  SM  policies  7  E  T  satisfying  the  resource  utilization  constraint 


T— 1 

E  *(«(*))  ^  R  (5) 

fc= 0 

with  the  notation  R(u(k))  =  Ri(k)m(k )•  Note  that  the  constraint  in  (5)  is  a  sample  path  constraint;  for  every 
realization  of  the  information  sets  I(k),  the  adaptive  policy  7  must  not  exceed  the  total  sensor  resources  available. 
Note  also  that,  given  the  finite  state  nature  of  the  set  of  possible  observation  outcomes  per  mode  Yrn  and  possible 
decisions  um,  the  number  of  possible  information  sets  after  k  —  1  actions  I(k)  is  countable.  This  implies  that  there 
is  a  finite  number  of  possible  admissible  SM  policies  that  satisfy  the  constraint  (5) . 

The  above  problem  is  a  class  of  finite-state,  finite-observation  partially  observed  Markov  decision  problems 
studied  in  [2,  1,  11,  10,  3],  with  the  special  structure  that  the  underlying  state  dynamics  are  trivial,  and  the 
presence  of  the  sample  path  constraints  of  (5).  Such  problem  scan  be  transformed  into  fully-observed  Markovian 
decision  problems  in  terms  of  a  sufficient  statistic:  the  conditional  probability  distribution  of  the  state  x  given 
information  I(k),  as  follows:  Let  S  C  RK  denote  the  space  of  probability  distributions  on  X,  and  let  Sn  denote 
the  space  of  probability  distributions  on  XN .  The  conditional  distribution  vector  for  the  composite  state  x  given 
the  information  I(k),  P(x\I(k))  £  Sn,  can  be  viewed  as  an  information  state,  a  sufficient  statistic  summarizing  the 
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past  observations.  The  recursive  evolution  of  this  information  state  in  response  to  an  action  u(k) 

=  (■ i(k),m(k )) 

can  be  described  by  Bayes’  rule  as 

P(x\I(k  + 1)) 

=  P{x\I(k),u(k),ym^)(k)) 

(6) 

P(ym{k)(k)\x,I(k),u(k))P(x\I(k)) 
P(ym(k)(k))\I(k),u{k )) 

(7) 

P(ym(k)  {k)\xi(k) ,  m(k))P{x\I{k)) 
P(ym(k)(k))\I(k),u(k)) 

(8) 

with  the  initial  condition 

N 

pwm  =  fl«m 

(9) 

i=  1 

Under  the  previous  independence  assumptions,  the  following  lemma  establishes  a  convenient  representation: 


Lemma  2.1  Under  the  SM  problem  assumptions,  the  conditional  probability 

N 

P(x\I(k))  =  Y[P(Xi\I(k))  (10) 

i=l 

where  the  evolution  of  P{xi\I(k))  under  sensing  action  u(k)  =  ( i(k),m(k ))  and  observed  value  ym(k)(k)  given  by 


P{xi\I{k  +  1)) 


P(xi\I(k)) 

P(Vm(k)  (fc)k.(fc)  ,m(k))P(xj\I(k)) 

,  Ef=i  P(ym(k-)(k)\xi=jJ(k))P(xi=j\I(k )) 


if  i(k )  ^  i 
otherwise 


(11) 


The  proof  of  this  lemma  is  straightforward  by  induction,  as  the  independence  assumption  of  the  object  types  Xi 
guarantees  the  Lemma  is  satisfied  at  k  =  0,  and  (8)  establishes  the  recursion.  Note  also  that  P{xi\I{k))  depends 
only  on  measurements  in  I{k)  corresponding  to  object  i. 

The  importance  of  Lemma  2.1  is  that  we  can  characterize  the  information  state  as  a  product  of  marginal 
distributions,  in  SN ,  as  opposed  to  a  joint  distribution  in  Sn-  As  notation,  define  7 Ti(k)  to  be  the  conditional 
probability  distribution  of  x-t  given  information  I(k): 


ni(k)  =  P(xi\I(k))  (12) 

The  vector  7 Ti(k)  has  components  =  P(xi  =  j\I(k)).  The  results  of  Lemma  2.1  establish  the  following 

representation  for  the  conditional  probability  distribution  of  the  entire  state:  P(x\I(k))  can  be  computed  from 
7 Ti(fc),  i  =  1, . . . ,  N.  Define  the  information  vector 


(13) 


For  a  given  observation  ym  using  mode  m  on  object  index  i,  define  the  observation  probability  matrix  as  the  K  x  K 
diagonal  matrix 

Bi{ym)  =  diag{P(j/m|a:i  =  l,m)1P{ym\xi  =  2 ,m), . .  .,P(ym\xi  =  K,m)}  (14) 

With  this  notation,  we  can  define  the  evolution  of  the  information  vector  in  response  to  a  measurement  ym  obtained 
from  a  sensing  action  ( i,m )  in  terms  of  an  evolution  operator  on  (SN ,U,Y)  as 


(  771  \ 


T(tt,u=  (i,m),y) 


1 

Bi(y)nj 
eT  Bi(y),Ki 

iri  +  1 


nN 


(15) 
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where  e  is  a  I\  -dimensional  vector  of  all  ones. 

Conceptually,  the  SM  problem  described  above  can  be  solved  by  stochastic  dynamic  programming  [3].  The 
resource  constraint  in  (5)  can  be  incorporated  into  the  dynamics  to  obtain  a  dynamic  programming  recursion,  as 
follows.  Define  a  value  function  V{i f,C)  to  be  the  optimal  solution  of  SM  in  (3)-(5)  when  the  initial  information 
is  if  and  the  available  sensor  resource  level  is  R  =  C.  The  value  function  V  is  thus  defined  on  SN  x  R+.  The  SM 
optimization  problem  is  stated  as  a  total  cost  problem  with  nonnegative  costs,  for  which  the  optimal  value  function 
satisfies  Bellman’s  equation  [3],  as  described  below.  Let  U{R)  C  U  denote  the  set  of  sensor  actions  («,m)  such 
that  Rim  <  R-,  this  is  the  subset  of  sensor  actions  that  are  feasible  when  there  are  only  R  resources  left.  At  each 
decision  stage,  there  is  a  choice  of  stopping  and  classifying  the  objects  with  the  available  information,  or  taking 
additional  measurements.  The  optimal  value  function  therefore  satisfies  the  Bellman  equation 


V%R) 


N 


min  E 

v^ex  E  '•(''/•./I-,,. 

i=  1  V%  j=l,...,K 

min  Ev{V(T(if,  u,  y),  R 


Ri'm')}] 


(16) 


where 


Ev{V(T(n,u,y),R- Ri>m')}  =  ^  P(y\I(k),u)V(T(n,u,y),R-  Ri'm') 

y£Ym, 

=  ^2  eTBi'(y)TTiV(T(Tr,u,y),R.  -  Ri>m>) 

y£Ym, 


(17) 

(18) 


This  recursion  defines  the  optimal  value  function  from  a  given  information  vector  and  a  given  resource  level  in 
terms  of  the  value  function  at  other  information  vectors  evaluated  with  strictly  less  resource  levels.  Furthermore, 
we  have  boundary  conditions  for  this  recursion  as  follows:  Let  Rmin  =  min,;im  Rim.  Then,  the  set  of  admissible 
modes  U(R)  is  empty  for  R  <  Rmin-  Thus, 


N 

V(n,R)  =  V' min  V'  c(vi,j)nij  if  R  <  Rmin 

z -  Vi(ziX  ^ ' 

i=  1  j=l,...,K 


(19) 


Eqs.  (16)-(19)  can  be  used  recursively  to  compute  the  optimal  value  for  all  information  states  and  nonnegative 
resource  levels. 

Note  that  the  initialization  of  the  recursion  decouples  into  N  independent  optimizations,  as  there  are  no  coupling 
constraints  on  the  decisions  Vi,  and  the  local  decision  costs  c(vi,Xi)  depend  only  on  the  marginal  probability 
distributions  of  each  object’s  type.  However,  the  recursion  (16)  does  not  preserve  this  decomposability.  The 
coupling  arises  primarily  because  of  the  resource  use  constraints  in  (5);  the  decision  of  which  object  to  view 
and  which  mode  to  use  depends  on  the  information  vector  of  all  the  objects  and  the  available  resources.  Thus, 
the  dynamic  programming  induction  must  be  carried  out  for  the  entire  state  i r(f),  which  becomes  a  formidable 
computation  problem  even  for  moderate  numbers  of  objects. 


3  Relaxed  Formulation  and  Lower  Bounds  on  Classification  Perfor¬ 
mance 

A  possible  approach  to  overcoming  the  computational  difficulty  of  the  previous  formulation  is  to  relax  the  sample 
path  sensor  resource  use  constraints  (5)  and  use  an  averaged  version  of  the  same  constraints,  as 

T 

E{J2R(u(k))}<R  (20) 

fc= 1 

This  approach  replaces  a  large  set  of  constraints  (one  per  sample  path)  by  a  single  aggregate  constraint.  Note  that 
any  SM  strategies  that  satisfy  (5)  will  also  satisfy  (20).  Thus,  this  approach  increases  the  set  of  admissible  SM 
strategies.  Let  J*  denote  the  optimal  classification  cost  of  the  original  SM  problem  in  (4)-(3)  with  constraints  (5). 
Let  J*A  denote  the  optimal  classification  cost  of  the  SM  problem  in  (4)- (3)  with  constraints  (20).  This  leads  to  the 
following  lemma: 


77 


Lemma  3.1  J*  >  J*A 

The  relaxed  SM  problem  has  a  single  coupling  constraint  relating  the  sensing  actions  on  different  objects.  This 
structure  can  be  exploited  using  Lagrange  multipliers  as  follows.  Let  A  >  0  denote  a  Lagrange  multiplier.  Consider 
the  new  SM  objective  for  admissible  SM  policies  in  T  as 

N  T—l 

J( A, 7)  =  E1{s^c{vi,xi) }  +  X[E^{^2  R(u(k))}  -  1?]  (21) 

i=  1  fc=0 


Consider  now  the  unconstrained  SM  problem  of  finding  adaptive  SM  strategies  7  and  an  adaptive  stopping  time 
T  to  minimize  (21).  If  (7 ,T)  is  an  adaptive  SM  policy  that  satisfies  (20),  the  second  term  in  (21)  is  nonpositive. 
Denote  by  J*( A)  the  optimal  value  of  (21)  over  all  adaptive  SM  strategies  7  €  T.  Then, 


Lemma  3.2  For  all  values  of  A  >  0, 


In  particular, 


J*  >J*A>  j*( A) 

(22) 

J*  >  sup  J*  (A) 

A>0 

(23) 

Lemma  3.2  is  a  consequence  of  weak  duality  in  nonlinear  programming  [4].  Note  that  the  number  of  adaptive  SM 
strategies  that  satisfy  (21)  is  finite,  because  the  set  of  possible  histories  1(h)  is  finite  for  all  k.  Thus,  computation 
of  J*A  is  an  integer  programming  problem,  and  computation  of  supA>0  J*( A)  is  its  dual  problem.  The  key  issue  is 
whether  the  lower  bounds  J*( A)  can  be  computed  efficiently.  Rewrite  (21)  for  7  £  T  as 

JV  T-l 

J( A,  7)  =  Ej{^2[c(vi,  Xi)  +  A  ^2  R(u(k))5(i(k)  -  i)]}  -  A R  (24) 

i—1  k— 0 

where  the  indicator  function  S(i)  =  1  if  i  =  0,  and  0  otherwise.  This  suggests  that  optimization  of  J(A)  may  be 
separable  across  individual  objects  i. 

Partition  the  information  I{k)  into  disjoint  sets  Ii(k),  where  J,(fc)  are  the  sensing  actions  and  measurement 
actions  applied  to  object  i: 

Ii(k )  =  {(u{j),y{j))\j  <  k,  i(j)  =  *}  (25) 

Note  that  the  conditional  probability  vector  77  only  changes  on  measurements  included  in  I-fk) .  We  wish  to  restrict 
the  set  of  adaptive  SM  strategies  to  a  subset  where  the  decision  to  apply  a  sensor  action  for  object  i  depends  only 
on  the  information  previously  collected  for  object  i.  We  refer  to  this  subset  of  strategies  as  adaptive  local  SM 
strategies,  defined  as: 

Definition  3.1  An  adaptive  local  SM  policy  is  an  adaptive  SM  policy  7  and  stopping  times  Ti,i  =  1, . . . ,  TV,  with 
the  properties  that,  for  each  sensing  action  instance  k, 

1.  If  u(k)  =  (■ i(k),m(k )),  then  i(k)  =  k  mod  TV  +  1. 

2.  The  selected  sensor  mode  m(k)  depends  only  on  the  information  I^k)  ■ 

3.  For  each  object  i,  there  is  a  stopping  time  Ti  which  depends  only  on  Ii(Ti)  such  that,  for  all  k  >  Ti,  if  i  =  k 
mod  TV  +  1,  no  sensing  action  is  taken.  If  k  <  Ti  and  i  =  k  mod  TV  +  1,  then  u(k)  =  ( i ,  m)  for  some  mode 
m  in  {1,  ... ,  M}. 

4.  At  time  Ti,  the  local  decision  Vi  for  object  i  is  selected  as  a  function  of  I i(Ti). 

Adaptive  local  SM  strategies  use  a  round-robin  schedule  for  selecting  which  objects  to  measure.  Thus,  the  choice 
of  sensing  object  for  each  action  is  not  adapted  to  the  prior  information.  Furthermore,  the  choice  of  sensing  mode 
for  each  action  on  object  i  depends  only  on  the  prior  information  collected  on  that  object.  In  addition,  there  is  an 
independent  stopping  time  for  each  object  i  such  that  a  final  classification  decision  is  made  on  object  i,  based  only 
on  prior  information  collected  on  that  object.  Note  that  there  are  decision  instances  k  where  no  sensing  action 
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is  taken,  when  k  >  Ti  and  i  =  k  mod  N  +  1;  these  instances  correspond  to  times  after  a  final  decision  has  been 
selected  for  object  i.  The  effective  stopping  time  of  an  adaptive  local  SM  policy  is  defined  as  T  =  maxj=ii...)jv  T*, 
and  is  the  earliest  time  at  which  every  object  has  a  final  classification  decision.  Thus,  adaptive  local  SM  strategies 
can  be  viewed  as  a  subset  of  the  class  of  adaptive  SM  strategies. 

Let  Tl  denote  the  set  of  adaptive  local  SM  policies.  For  a  given  amount  of  sensor  resources  R ,  there  are  a  finite 
number  of  feasible  adaptive  local  SM  strategies.  In  general,  T^  is  a  countable  discrete  set.  For  the  purposes  of 
bound  computation,  we  will  expand  to  include  mixed  policies,  consisting  of  probabilistic  mixtures  of  policies  in 
1  /.: 

Definition  3.2  A  mixed  local  SM  policy  is  a  probability  distribution  g(y)  over  T l  such  that  local  SM  policy  7  is 
selected  for  use  with  probability  p(j).  The  set  of  mixed  local  SM  strategies  is  denoted  by  Q(rs). 

Consider  the  problem  of  minimizing  the  relaxed  cost  (24)  over  local  SM  policies  T l-  Since  C  T,  we  have 

min  J(A,  7)  <  min  J( A,  7)  (26) 

76r  7er_L 

Furthermore,  since  (24)  is  an  unconstrained  objective,  the  minimum  in  mixed  local  SM  policies  is  achieved  by  a 
pure  local  SM  policy,  so 

min  J(A,  7)  <  min  V  q{i)J{\,i)  (27) 

7er 

The  importance  of  mixed  local  SM  strategies  is  highlighted  in  the  theorem  below. 

Theorem  3.1  Consider  any  admissible  adaptive  SM  policy  7  £  T.  Then,  there  exists  a  mixed  local  SM  policy 
q  €  <2(rs)  such  that  the  expected  classification  costs  in  (4)  and  the  expected  total  resource  use  in  (20)  are  equal 
under  both  policies  7  and  q. 

The  proof  of  this  result  is  by  construction,  and  is  included  in  the  Appendix.  This  result  implies  the  following 
inequality: 

min  J(A,  7)  >  min  V  q(j)J(*,7)  (28) 

7er  q£Q(rL)^ 

Combining  (27)  and  (28)  yields  the  following: 

min  J(A,7)  =  min  V  q(j)J( A,  7)  =  min  J( A,  7)  (29) 

76T  q£Q(rL)  761^ 

7^r  L 

Eq.  (29)  implies  that  lower  bounds  for  the  achievable  classification  performance  can  be  computed  by  optimizing 
over  local  SM  policies  only.  For  each  local  SM  policy  7  €  Ti,  let  7,  denote  the  policy  that  is  used  for  instances  k 
when  actions  are  taken  for  object  i,  and  let  l  Li  be  the  set  of  such  admissible  local  SM  policies  for  object  i.  Thus,  7 * 

selects  actions  for  object  i  based  on  past  observations  /,(fc),  and  selects  a  stopping  time  Tt  and  a  final  classification 

Vi  at  that  stopping  time.  The  importance  of  local  SM  policies  is  that  the  optimization  in  (29)  decouples  over 
objects  as 


JV  T-l 

min  J(A, 7)  =  .E7{y^[c(u,,  xf)  +  A  R(u(k))5(i(k)  —  i)]}  —  XR 

i= 1  k—0 

N  Ti- 1 

=  min  Eli  [c(u*,  xf)  +  A  i?(u(fc))]  —  A R 

i=  1  ^  k>0:i=k  mod  JV+1 

This  implies  that  computation  of  the  bounds  can  be  achieved  with  N  independent  optimization  problems  for  each 
value  of  A.  Furthermore,  the  optimal  bound  can  be  computed  as  in  Lemma  3.2,  as 

N  Ti- 1 

J*  >  su  vam.Eli[c(vi,Xi)  +  A  ^  R(u(k))]  —  XR}  (30) 

^  i=  1  ^  k>0:i=k  mod  iV+1 
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Note  that  the  right  hand  side  of  (30)  is  the  dual  of  the  following  linear  programming  problem: 


min  E  ?( 7)^7  J(7)  (31) 

ieQ(TL)^L 

subject  to 

E  9(7)^7[  e  R«k))]  <  R 

7  ElE  0T_1 

E  q(i)  =  1 

7erz, 

which  is  a  linear  program  over  the  choice  of  probability  distributions  q  €  Q(Tl).  This  can  be  exploited  to  solve 
efficiently  for  the  bound.  Specifically,  note  that  this  is  a  linear  program  subject  to  two  constraints,  which  implies 
that  the  optimal  mixed  local  SM  policy  q  will  have  support  only  on  two  pure  local  SM  policies.  This  property  will 
be  exploited  in  the  next  section  for  bound  computation. 


(32) 

(33) 


4  Computation  of  the  Lower  Bound 

There  are  two  potential  approaches  to  compute  a  lower  bound:  a  dual  approach,  based  on  Lagrangian  relaxation 
[16],  that  optimizes  (30)  over  the  choice  of  dual  variable  A,  and  a  primal  approach  based  on  solving  the  linear 
program  (31)-(33).  The  dual  approach  is  straightforward,  and  uses  techniques  from  nondifferentiable  optimization 
[19]  to  search  the  space  of  possible  A.  The  primal  approach  is  harder,  because  the  optimization  is  over  a  large 
space  of  possible  values  of  mixture  probabilities  q.  However,  this  mixture  has  very  sparse  support,  which  makes  it 
suitable  for  column  generation  algorithms  [18]. 

A  fundamental  step  in  either  approach  is  the  computation  of  the  optimal  local  SM  strategies  for  a  fixed  value 
of  A  for  each  object.  For  object  i,  one  must  solve  the  local  problem  given  A: 

Ti- 1 

min  Eli  [c(vi,Xi)  +  A  E  i?(u(A;))]  (34) 

n  z ' 

k>0:i=k  mod  JV+1 

This  problem  is  a  multi-stage  single  object  partially  observed  Markov  decision  problem,  with  sufficient  statistic 
given  by  the  marginal  probability  distribution  77  (A;).  Furthermore,  we  can  reduce  the  action  instants  to  a  new 
counter  k'  indexing  only  the  action  opportunities  for  object  i,  to  obtain 

r'—i 

min  E7i[c(vi,Xi)  +  A  R,m(k')]  (35) 

7*61  Li 

k' 

The  resulting  POMDP  problems  are  small  enough  to  solve  using  existing  algorithms  such  as  those  overviewed  in 
[1,  11,  10,  13,  14].  These  algorithms  exploit  Smallwood  and  Sonclik’s  efficient  parameterization  [2]  of  the  optimal 
cost-to-go  at  stage  k!  as  a  minimum  of  linear  functions  of  the  statistic  7r j(fc'),  and  are  efficient  for  problems  with  a 
few  discrete  true  states. 

Solution  of  the  N  decoupled  problems  (35)  yields  a  local  SM  policy  7  €  T/,,  for  which  the  expected  classification 
cost  E-y  [X)i=i  c{vi,  Xi)]  and  expected  resource  use  EEj  R(u(k))]  are  computed  from  the  solution.  This  provides 
the  starting  point  for  the  use  of  column  generation  [18]  for  solution  of  (31)-(33).  Column  generation  was  used  by 
Yost  [21,  22,  23]  in  his  work  on  POMDPs  for  resource  assignment  and  was  also  exploited  in  [8]  for  the  solution 
of  stochastic  weapon  assignment  problems.  The  main  result  of  [21,  22,  23]  is  an  efficient  constraint  generation 
algorithm  which  solves  the  linear  program  in  (31)-(33)  while  considering  only  mixtures  of  a  very  small  number  of 
local  strategies.  We  summarize  their  algorithm  below. 

The  algorithm  starts  with  an  initial  set  of  pure  local  SM  policies  jd  indexed  by  d  =  1, . . . ,  D,  with  known 
expected  classification  performance  Jd  and  expected  resource  use  Rd.  The  first  step  in  the  algorithm  is  to  solve 
the  linear  program  in  (31)-(33)  restricted  to  mixtures  of  the  d  =  1, . . . ,  D  initial  policies.  Since  the  support  of  the 
admissible  mixed  policies  is  restricted,  the  solution  provides  an  upper  bound  JUB  to  the  optimal  cost.  Denote  by 
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A d  the  optimal  dual  price  of  the  resource  constraint  (33)  in  this  solution.  The  constraint  generation  algorithm  uses 
this  optimal  dual  price  value  in  (35)  to  generate  a  new  candidate  local  SM  policy  7ZJ+1,  solving  N  independent 
POMDP  problems.  The  combined  solution  of  the  N  subproblems  also  provides  a  lower  bound  JLB  on  the  optimal 
performance,  as  described  in  Lemma  3.2.  The  key  result  in  the  constraint  generation  algorithm  is  stated  as  follows 

Lemma  4.1  Consider  the  pure  local  SM  policy  generated  by  the  solution  of  (35).  If  JLB  =  JUB ,  the  optimal 
solution  over  all  mixtures  of  local  SM  policies  is  a  mixture  of  the  local  strategies  indexed  by  d  =  1, . . . ,  D.  Otherwise, 
the  pure  local  SM  policy  'yD+1  can  be  used  as  part  of  a  mixed  strategy  which  provides  a  cost  lower  than  JUB . 

The  proof  of  this  result  is  given  by  Gilmore  and  Gomory  [18].  It  is  based  on  the  fact  that  solving  the  decoupled 
dual  problem  (35)  is  equivalent  to  finding  the  local  SM  policy  which  has  the  greatest  impact  in  reducing  the  cost 
of  the  current  best  mixture.  This  leads  to  a  dynamic  column  generation  algorithm,  as  follows:  if  JLB  <  JUB , 
increase  the  number  of  local  SM  policies  considered  in  the  LP  by  adding  the  new  pure  local  SM  policy  7'D+1,  and 
resolve  the  primal  problem  in  (31-33)  with  support  restricted  on  {71, . . .  For  the  optimal  dual  value,  solve 

the  relaxed  problem  in  (35),  and  compare  the  new  upper  and  lower  bounds.  Each  iteration,  reduces  the  upper 
bound,  until  the  lower  bound  and  upper  bound  estimates  are  close  enough.  By  the  lemma  above,  the  optimal 
solution  will  be  obtained  without  enumerating  all  of  the  pure  local  strategies. 

5  Extension  to  Multiple  Sensors 

The  development  of  the  previous  sections  carries  through  with  little  modification  when  multiple  sensors  are  used. 
The  key  difference  is  that  there  is  a  separate  resource  constraint  for  each  sensor.  Thus,  there  will  be  a  vector  of 
sensor  resources  Rs ,  where  s  is  a  sensor  index,  thus  resulting  in  a  vector  of  averaged  constraints  (20).  The  Lagrange 
multipliers  A  will  thus  be  vectors  instead  of  scalars.  Nevertheless,  all  of  the  lemmas  and  theorems  can  be  extended 
to  the  multisensor  case  with  minor  modifications. 

The  main  assumption  that  was  used  in  the  single  sensor  formulation  was  that  only  one  sensor  action  would 
be  performed  simultaneously.  While  this  assumption  is  accurate  for  single  sensor  problems,  it  is  an  optimistic 
assumption  for  multiple  sensor  problems  where  time  or  duty  cycle  is  the  main  resource.  Multisensor  problems  are 
often  required  to  operate  the  sensors  simultaneously,  thereby  potentially  degrading  the  achievable  performance. 
However,  note  that  the  local  SM  strategies  that  are  used  in  the  lower  bound  computation  allow  for  the  parallel 
execution  of  sensing  actions  on  different  objects,  so  as  long  ans  the  number  of  objects  is  greater  than  the  number 
of  sensors,  there  won’t  be  much  performance  degradation  from  executing  simultaneous  sensing  actions. 

The  column  generation  algorithm  discussed  in  the  previous  section  extends  naturally  to  multiple  sensors.  When 
there  are  L  sensors,  the  optimal  mixed  local  SM  policies  will  be  mixtures  of  L  +  1  pure  local  SM  policies.  Nondif- 
ferentiable  optimization  algorithms  that  maximize  the  dual  cost  can  also  be  used  in  this  case. 

6  Examples 

In  this  section,  we  present  computational  experiments  comparing  the  lower  bounds  described  in  the  previous  section 
with  the  Monte  Carlo  performance  of  a  pair  of  SM  feedback  policies. 

We  consider  scenarios  involving  a  single  sensor  with  100  unknown  objects.  The  objects  can  be  of  three  different 
types  (K  =  3),  corresponding  to  cars,  trucks  and  military  vehicles.  The  sensor  can  be  electronically  steered  to 
collect  images  of  each  object;  the  sensor  has  a  low  resolution  mode  that  takes  1  second  per  image  (Rn  =  1), 
and  a  higher  resolution  mode  that  requires  5  seconds  per  image,  (Ri 2  =  5).  Low  resolution  imagery  is  useful 
in  separating  cars  from  trucks  and  military  vehicles,  but  separating  trucks  from  military  vehicles  requires  high 
resolution  imagery. 

In  the  experiments,  we  start  with  the  a  priori  information  that  there  are  on  average  10  military  vehicles,  20 
trucks  and  70  cars  in  a  group  of  100  objects.  Thus,  each  object  has  an  initial  probability  distribution  over  type  of 
(0.1, 0.2,  0.7),  where  types  are  indexed  as  military  vehicle,  truck  and  car.  We  assume  that  the  images  generated  by 
the  sensor  are  processed  into  binary  outputs,  where  yj3  =  1  indicates  that  object  i  is  estimated  to  be  potentially  a 
military  vehicle,  and  yij  =  2  indicates  that  object  i  is  likely  not  to  be  a  military  vehicle. 
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The  objective  of  the  problem  is  to  determine  as  accurately  as  possible  which  objects  are  military  vehicles  (type 
1).  Thus,  the  classification  costs  are  given  by  as  a  3  x  3  matrix  where  u,  is  the  row  index: 

/  0  M  D  MD\ 

(d(vi,Xi))=  \FA  0  0  (36) 

\FA  0  0  / 

where  MD,FA  will  be  variables  in  the  experiments  representing  false  alarm  and  missed  detection  costs.  In  the 
experiments,  FA  is  kept  constant  to  1,  while  MD  varies  from  1  to  80,  indicating  the  relative  cost  of  failing  to 
classify  correctly  a  military  vehicle. 

To  complete  the  problem  specification,  we  need  to  describe  the  conditional  probability  distribution  of  the 
measurements  and  the  constraints  on  the  decisions.  The  conditional  probability  distributions  p(y\x,m)  are  given 
by: 


p(yi  =  l|lj  1)  =  0.90  : 
p(yi  =  1|2, 1)  =  0.90  : 
p(yi  =  1|3, 1)  =  0.10  : 

p(y2  =  1|1,2)  =  0.80  : 

P(V2  =  1|2,2)  =  0.15  : 
p(y2  =  1|3,2)  =  0.05  : 

Note  that  mode  1  is  unable  to  distinguish  between  types 


p(yi  =  2|  l,  l)  =  o.io 
p(yi  =  2 1 2,  l)  =  o.io 
p(yi  =  2|3, 1)  =  0.90 
p{jj2  —  2 1 1 ,  2)  =  0.20 
p{jj2  =  2 1 2,  2)  =  0.85 
p{jj2  =  2|3, 2)  =  0.95 

and  2  (military  vehicles  vs  trucks),  but  mode  2  can  do 


In  terms  of  constraints,  we  assume  that  there  is  a  single  resource  pool  of  R  seconds  to  be  used  before  all  objects 
need  to  be  classified.  This  number  will  also  be  varied  across  the  experiments  from  300  seconds  to  700  seconds,  to 
evaluate  the  bounds  and  algorithm  performance  for  scenarios  where  the  amount  of  sensor  resources  ranges  from 
poor  to  rich. 

In  order  to  evaluate  the  utility  of  the  lower  bound,  we  compare  the  bound  with  the  performance  of  two  adaptive 
SM  algorithms:  a  variation  of  Kastella’s  discrimination  gain  algorithm  [5],  which  is  a  sequential  algorithm  for 
selecting  the  best  sensor  mode  and  target  on  the  basis  of  maximizing  the  expected  entropy  reduction  in  the 
distribution  of  object  type  per  unit  sensor  resource  applied,  and  a  dynamic  SM  scheduling  algorithm  based  on 
Lagrangian  relaxation  and  POMDP  approximations  described  in  [7].  The  algorithms  are  summarized  next. 

The  discrimination  gain  algorithm  of  [5]  starts  from  the  sufficient  statistic  7r(fc),  consisting  of  the  conditional 
probability  type  of  each  object  after  k  sensor  actions  have  been  taken.  Associated  with  each  object  is  the  entropy 
of  this  distribution, 

K 

=  ~y log  7 Tj(k)  (37) 

i= i 

For  each  sensor  mode  m  and  each  object  i  such  that  the  available  resources  allow  the  use  of  that  mode,  the  expected 
entropy  from  using  mode  m  on  object  i  obtained  from  (11)  as 


Ey{H(Tn(k  +  l)\y,i,m)}  =  ^  P(y\i,m)H(TTi(k  +  l)\y,i,m) 

y&Ym 


(38) 


The  discrimination  gain  algorithm  computes  an  index  for  each  object  and  sensor  mode,  the  expected  entropy  gain 
per  unit  resource,  as 


Gain(i,  m)(k) 


H(TTj(k))  -  Ey{H('Ki(k+  1)| y,i,m)} 
Rim 


(39) 


and  selects  as  its  next  sensing  action  the  object  and  mode  that  has  the  highest  Gain(i,m)(k).  Once  all  sensing 
resources  are  exhausted,  the  classification  of  each  object  is  performed  in  a  Bayes’  optimal  manner  to  minimize  the 
expected  classification  cost. 

The  Lagrangian  relaxation  algorithm  of  [7]  uses  a  receding  horizon  planning  approach  based  on  a  POMDP 
algorithm  that  is  similar  to  that  used  for  computation  of  the  lower  bound,  with  the  additional  restriction  that  a 
maximum  of  three  actions  per  object  are  considered.  Since  there  is  a  resource  constraint,  the  algorithm  performs 
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a  simple  line  search  to  vary  a  Lagrange  multiplier.  For  each  value  of  the  Lagrange  multiplier,  a  local  SM  policy 
is  computed  from  (24)  that  uses  a  maximum  of  three  sensing  actions  per  object,  using  a  variation  of  the  Witness 
algorithm  [12,  13].  The  algorithm  selects  the  best  local  SM  policy  found  this  way  that  satisfies  the  expected 
resource  use  bound  in  (20).  This  policy  is  likely  to  be  conservative  with  respect  to  the  use  of  resources,  because  to 
fully  utilize  the  available  resources  requires  the  use  of  mixed  local  SM  policies.  A  local  SM  policy  can  be  viewed 
as  a  decision  tree  for  each  object,  as  in  [7].  The  initial  action  in  such  decision  trees  is  deterministic,  based  on 
the  current  knowledge,  and  the  future  actions  are  contingent  on  the  measured  values.  The  Lagrangian  relaxation 
algorithm  computes  the  local  SM  policy,  and  schedules  the  initial  sensing  actions  for  each  object.  Once  all  the 
initial  sensing  actions  are  observed,  the  probability  state  is  updated,  the  resource  state  is  decremented,  and  the 
problem  is  solved  again  from  the  new  probability  state  n  and  the  remaining  resource  level.  This  process  continues 
until  no  sensor  resources  remain  to  take  additional  sensing  actions,  at  which  time  all  of  the  objects  are  classified 
for  minimum  discrimination  cost  based  on  the  available  information. 

Each  algorithm  was  simulated  for  100  independent  Monte  Carlo  runs  using  the  same  measurement  outcomes  to 
evaluate  its  average  performance  for  three  different  levels  of  sensor  resources:  300  seconds,  500  seconds  and  700 
seconds.  Figure  1  shows  the  results  for  the  two  algorithms  and  the  lower  bound  for  300  seconds  for  a  range  of 
values  of  MD  from  1  to  80.  Figure  2  shows  similar  results  for  500  seconds,  and  Figure  3  shows  the  results  for  700 
seconds.  The  results  indicate  that  neither  algorithm  consistently  performs  close  to  the  lower  bound,  but  there  are 
conditions  where  the  performance  of  the  algorithms  and  the  lower  bounds  are  close.  For  instance,  when  MD  is 
close  to  1,  the  costs  of  missed  detections  and  false  alarms  is  close,  and  policies  such  as  maximizing  information 
gain  as  measured  by  entropy  are  near-optimal.  Similarly,  the  performance  of  the  Lagrangian  Relaxation  algorithm 
is  closer  to  the  lower  bound  for  limited  sensor  resources,  as  the  limited  lookahead  approximation  is  closer  to  the 
actual  optimal  number  of  sensor  actions  per  object.  However,  there  is  significant  room  for  improvement  in  both 
policies:  the  discrimination  gain  algorithm  fails  to  incorporate  the  relative  values  of  different  types  of  errors  in  its 
information  seeking  strategy,  and  the  Lagrangian  relaxation  is  conservative  in  that  it  does  not  use  mixed  strategies, 
and  thus  can  underutilize  sensor  resources. 


Figure  1:  Monte  Carlo  performance  of  algorithms  and  lower  bound  for  300  seconds  of  sensor  resource. 

It  is  possible  to  construct  curves  similar  to  a  receiver  operating  characteristic  (ROC)  by  varying  the  value  of 
MD.  Such  curves  can  characterize  the  potential  tradeoffs  in  system  performance  achieved  by  different  algorithms 
for  a  fixed  amount  of  sensor  resources.  Figures  4,  5  and  6  illustrate  the  resulting  ROC  curves  for  300,  500  and  700 
seconds  of  sensor  resources  for  the  two  algorithms.  Note  that  the  performance  of  the  two  algorithms  is  closer  than 
the  optimal  values  of  Figs.  1-3  imply. 


7  Discussion 

In  this  paper,  we  have  presented  a  mathematical  formulation  for  adaptive  multisensor  management  in  problems 
of  object  classification  as  a  partially  observed  Markovian  decision  problem.  We  developed  an  exact  stochastic 
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Figure  2:  Monte  Carlo  performance  of  algorithms  and  lower  bound  for  500  seconds  of  sensor  resource 
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Figure  3:  Monte  Carlo  performance  of  algorithms  and  lower  bound  for  700  seconds  of  sensor  resource 


Figure  4:  ROC  of  algorithms  for  300  seconds  of  sensor  resource. 
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Figure  5:  ROC  of  algorithms  for  500  seconds  of  sensor  resource. 


Figure  6:  ROC  of  algorithms  for  700  seconds  of  sensor  resource. 


dynamic  programming  algorithm  for  solution  of  these  problems.  However,  the  combinatorial  nature  of  the  decision 
space  when  multiple  objects  are  present  make  the  computations  prohibitive  even  for  small  time  horizons.  We 
developed  an  approximate  formulation  that  provides  a  lower  bound  on  the  achievable  performance  for  such  sensor 
management  problems.  This  lower  bound  is  obtained  by  expanding  the  space  of  admissible  SM  policies,  replacing 
a  sample  path  resource  utilization  constraint  by  an  expected  resource  use  constraint. 

The  resulting  lower  bound  formulation  is  an  integer  programming  problem,  that  has  a  simple,  separable  dual 
formulation.  A  key  result  in  establishing  this  separability  is  to  show  that  the  lower  bound  formulation  can  be 
solved  in  terms  of  a  subset  of  SM  policies  known  as  mixed  local  SM  policies,  which  are  random  mixtures  of  policies 
that  select  actions  on  each  object  based  only  on  the  past  information  collected  on  that  object.  This  results  in 
a  hierarchical  algorithm  for  computing  the  lower  bound,  where  dual  variables  are  selected  that  decouple  the  SM 
problem  into  independent  subproblems  for  each  object.  Each  of  the  independent  subproblems  can  be  solved  as  a 
low-dimension  partially  observed  Markov  decision  problem.  The  solutions  of  these  independent  subproblems  are 
then  used  to  improve  the  dual  variables,  until  an  optimal  lower  bound  is  obtained. 

We  presented  experimental  results  that  compared  the  lower  bound  with  the  performance  of  two  suboptimal  SM 
algorithms  available  in  the  literature.  The  experimental  results  established  that  the  performance  of  both  algorithms 
should  be  improved  substantially  in  order  to  achieve  the  lower  bound. 

The  lower  bound  developed  in  this  paper  can  be  used  as  a  reference  solution  for  the  development  of  effective  SM 
algorithms.  Furthermore,  the  approximation  used  in  developing  the  lower  bound  can  be  used  in  SM  algorithms 
that  attempt  to  optimize  this  lower  bound,  in  order  to  generate  practical  real-time  algorithms  whose  performance 
approaches  this  lower  bound.  This  requires  embedding  the  solution  algorithms  for  lower  bounds  into  a  real¬ 
time  algorithm  such  as  model-predictive  control,  and  developing  a  scheduling  algorithm  for  determining  when  to 
recomputed  the  SM  policies  using  a  receding  horizon  approach.  Development  of  real-time  SM  algorithms  with 
performance  that  approaches  the  lower  bound  remains  a  challenge  for  future  research. 
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8  Appendix 

Outline  of  proof  of  Theorem  3.1,  to  be  expanded  later 

Given  an  SM  policy  7,  construct  a  local  behavior  policy  rji  for  each  object  i,  which  uses  randomized  decisions 
at  each  decision  time,  such  that  the  marginal  distribution  of  the  decisions  for  object  i  are  the  same  under  7  and 
r]i,  as  follows: 

Using  policy  7,  compute  the  marginal  probability  distribution  of  the  first  sensor  action  made  on  object  i: 
note  that  this  may  include  making  no  sensor  action  at  all.  Use  this  probability  distribution  as  the  probability 
distribution  for  selecting  the  first  sensing  decision  in  policy  rji.  For  each  possible  measurement  value,  compute 
the  probability  distribution  of  the  next  sensor  action  under  policy  7  on  object  i.  Use  this  probability  distirbution 
as  the  distribution  for  selecting  the  second  sensor  action  on  object  i,  conditioned  on  the  measurement  obtained 
from  the  first  action.  Repeat  this  process  until  there  are  no  sample  paths  generated  by  policy  7  with  subsequent 
actions  on  object  i.  for  the  final  classification  decision  on  object  i,  compute  the  probability  distribution  of  the  final 
decision  after  the  final  sensing  outcome  on  object  i,  aggregating  over  the  sample  paths  generated  by  7. 

By  construction,  the  marginal  probability  of  sensor  actions  on  object  i  is  the  same  under  7  and  77.;.  Repeating 
this  construction  for  all  objects  i,  one  obtains  a  set  of  local  SM  random  policies  77  =  {771 , . . . ,  77^}  that  obtain 
the  same  expected  classification  performance  and  the  same  expected  resource  use  as  policy  7.  From  this  random 
policy  77,  one  can  construct  a  mixed  local  SM  policy  with  the  same  property,  using  the  standard  construction  for 
generating  mixed  policies  from  random  behavior  policies  in  decision  problems  with  perfect  recall. 


References 

[1]  G.  E.  Monahan,  “A  survey  of  partially  observable  Markov  decision  processes:  Theory,  models  and  algorithms,” 
Mgmy.  Sci.,  V.  28,  pl-16,  Jan.  1982. 

[2]  R.  D.  Smallwood  and  E.  J.  Sondik,  “The  optimal  control  of  partially  observable  Markov  processes  over  a  finite 
horizon”  Op.  Res.h,  V.  21,  p  1071-1088,  1973. 

[3]  D.  P.  Bertsekas,  Dynamic  Programming  and  Optimal  Control,  Vols.  I-II,  Athena  Scientific,  Belmont,  MA  1995. 

[4]  D.  P.  Bertsekas,  Nonlinear  Programming ,  Athena  Scientific,  Belmont,  MA,  1999. 

[5]  K.  Kastella,  “Discrimination  Gain  to  Optimize  Detection  and  Classification,”  IEEE  Trans,  on  Systems,  Man 
and  Cybernetics,  Part  A,  V.  27,  No.  1,  Jan.  1977. 

[6]  D.  A.  Castanon,  “Optimal  search  strategies  for  dynamic  hypothesis  testing,”  IEEE  Trans.  Sys.,  Man  & 
Cybernetics,  v.  25,  1995. 

[7]  D.  A.  Castanon,  “Approximate  Dynamic  Programming  for  Sensor  Management,”  Proc.  36th  IEEE  Conference 
on  Decision  and  Control,  San  Diego,  CA,  December  1997. 

[8]  D.  A.  Castanon  and  J.  M  Wohletz,  “Model  Predictive  Control  for  Unreliable  Dynamic  Task  Assignment,” 
Proc.  2002  Conf.  Decision  and  Control,  Las  Vegas,  NV,  Dec.  2002. 

[9]  G.  Cohen,  “Auxiliary  problem  principle  and  decomposition  of  optimization  problems,”  J.  Opt.  Theory  and 
Appl,  V.  32,  1980. 

[10]  W.  S.  Lovejoy,  “A  survey  of  algorithmic  methods  for  partially  observable  Markov  decision  processes,”  Annals 
of  Operations  Research,  v.  28,  1991. 

[11]  C.  C.  White,  “Partially  observed  Markov  decision  processes:  A  Survey,”  Ann.  of  Op.  Res.,  V.  32,  1991 

[12]  M.  L.  Littman,  A.  R.  Cassandra  and  L.  Pack-Kaelbling,  “Efficient  dynamic  programming  updates  in  partially 
observable  Markov  decision  processes,”  working  paper,  Brown  University,  Dec.  1995. 

[13]  A.  R.  Cassandra,  Exact  and  Approximate  Algorithms  for  Markov  Decision  Processes,  Ph.  D.  Dissertation, 
Brown  University,  Providence,  RI  1998. 


86 


[14]  A.  R.  Cassandra,  M.  L.  Littman  and  N.  L.  Zhang,  “Incremental  Pruning:  A  Simple,  Fast  Exact  Method 
for  Partially  Observed  Markov  Decision  Processes,”  Proc.  13th  Conf.  Uncertainty  in  Aritifical  Intelligence, 
Providence,  RI  1997. 

[15]  M.  R.  Garey  and  D.S.  Johnson, Computers  and  Intractability:  A  Guide  to  the  Theory  of  NP-Completeness, 
W.H.  Freeman,  New  York,  1979. 

[16]  A.  M.  Geoffrion,  “Lagrangian  relaxation  for  integer  programming,”  Math.  Prog.  Studies,  v.  2,  1974. 

[17]  V.  Krishnamurth  and  R.  J.  Evans,  “Hidden  Markov  Model  Multiarm  Bandits:  A  Methodology  for  Beam 
Scheduling  in  Multitarget  Tracking,”  IEEE  Trans.  Signal  Processing ,  V.  49,  N.  12,  Dec.  2001. 

[18]  P.  C.  Gilmore  and  R.  E.  Gomory,  “A  Linear  Programming  Approach  to  the  Cutting  Stock  Problem”  Operations 
Research,  V.  9,  1961. 

[19]  V.  M.  Demyanov  and  L.  V.  Vasilev,  Nondifferentiable  Optimization,  Optim.  Software,  New  York  1985. 

[20]  S.  J.  Benkoski,  M.  G.  Monticino,  and  J.  R.  Weisinger,  “A  Survey  of  the  Search  Theory  Literature,”  Naval 
Research  Logistics,  Vol.  38,  No.  4,  1991,  pp.  469-494. 

[21]  K.  A.  Yost,  Solution  of  Large-Scale  Allocation  Problems  with  Partially  Observed  Outcomes,  Ph.  D.  Thesis, 
Naval  Postgraduate  School,  Monterey,  CA,  Sept.  1998. 

[22]  K.  A.  Yost  and  A.  R.  Washburn,  “The  LP/POMDP  Marriage:  Optimization  with  Imperfect  Information,” 
Naval  Research  Logistics,  Vol  47,  No.  8,  607-619,  2000. 

[23]  K.  A.  Yost  and  A.  R.  Washburn,  “Optimizing  Assignments  of  Air-to-Ground  Assets  and  BDA  Sensors,” 
Military  Operations  Research,  Vol.  5,  No.  2,  77-91,  2000. 

[24]  R.  Chen  and  G.  L.  Blankenship,  “Dynamic  Programming  Equations  for  Discounted  Constrained  Stochastic 
Control,  ”  IEEE  Trans.  Automatic  Control,”  v.49,  no.  5,  May  2004. 


87 


Closing  the  Loop  in  Sensor  Fusion  Systems: 
Stochastic  Dynamic  Programming  Approaches 


Michael  K.  Schneider,  Member,  IEEE,  Gregory  L.  Mealy,  Member,  IEEE,  and  Felipe  M.  Pait,  Senior 

Member,  IEEE 


Abstract — This  paper  provides  an  overview  of  the  problem 
of  managing  sensor  resources  in  a  closed-loop  sensor  fusion 
system.  We  formulate  the  problem  in  a  stochastic  dynamic 
programming  framework.  In  so  doing,  we  expose  structure  in 
the  problem  resulting  from  target  dynamics  being 
independent  and  discuss  how  this  can  be  exploited  in  solution 
strategies.  We  illustrate  situations  in  which  we  believe  such 
sensor  management  techniques  are  especially  beneficial  with 
two  examples.  One  example  is  the  management  of  a  single 
sensor,  and  the  other  is  the  management  of  multiple  sensors. 
The  focus  of  both  examples  is  on  air-to-ground  tracking. 

I.  Introduction 

IN  this  paper,  we  address  control  aspects  of  sensor 
fusion.  For  the  sensor  fusion  problem  of  interest  here, 
one  would  like  to  infer  the  state  of  multiple  targets  from 
measurements  made  by  one  or  more  sensors  over  time. 
Targets  are  typically  located  on  the  ground  and  can  include 
vehicles,  buildings,  and  other  man-made  objects.  States  of 
interest  could  include  position,  velocity,  mode  (e.g.  on-  or 
off-road),  vehicle  type,  etc.  Estimates  of  the  states  are 
inferred  by  fusing  information  from  multiple  sensors  over 
time.  The  fusion  engine  responsible  for  piecing  together 
information  from  different  types  of  sensors  will  typically 
create  hypotheses  by  associating  new  observations  with 
previously  detected  targets.  Alternative  hypotheses  are 
formulated  to  deal  with  ambiguities  caused  by  incomplete 
or  even  contradictory  information.  New  hypotheses  are 
created  and  abandoned  as  data  is  accumulated  that  indicates 
the  current  target  states  have  changed  or  resolves 
ambiguities  in  the  past  states  of  targets.  The  data  can  be 
generated  by  many  different  types  of  sensors,  including 
airborne  surveillance  radars,  video  sensors,  etc.  The  sensors 
are  managed  to  collect  the  appropriate  measurements.  We 
view  sensor  resource  management  (SRM)  as  the  control 
problem  of  allocating  available  sensor  resources  to  obtain 
the  best  awareness  of  the  situation. 

This  material  is  based  upon  work  supported  in  part  by  the  U.S.  Air 
Force  under  Contract  Nos.  F33615-02-C-1 197,  F33615-03-M-1515,  and 
F3365-02-C-1 129. 

The  authors  are  with  ALPHATECH,  Inc.,  Burlington,  MA  01803  USA. 
(phone:  781-273-3388;  fax:  781-273-9345;  e-mail:  {michaels,  gmealy, 
fpait}  @  ALPHATECH.com). 


Efficient  sensor  management  requires  consideration  of 
the  value  of  particular  pieces  of  information  to  the  fusion 
engine  at  each  moment,  so  the  plant  to  be  controlled 
comprises  not  only  the  sensors  and  communication 
systems,  but  also  the  fusion  engine  that  processes  the 
information  collected  by  them,  as  illustrated  in  Fig.  1.  The 
plant’s  inputs  are  precisely  the  requests  that  the  sensor 
management  system  is  allowed  to  make,  and  its  outputs 
include  all  the  information  obtained  from  the  sensors.  The 
state  of  the  plant  is  then  the  total  information  available  to 
the  fusion  engine,  and  in  principle  also  to  the  SRM 
controller,  at  a  given  time.  The  dimension  of  the  state  is  not 
fixed:  it  increases  as  information  is  collected,  and  new 
tracks  are  initiated.  It  also  decreases  when  new  information 
results  in  hypotheses  being  resolved,  and  when  the 
hypothesis  tree  is  pruned  of  alternatives  that  are  considered 
less  likely. 

From  this  point  of  view  the  process  model  is  completely 
deterministic,  and  full  information  about  the  process  is 
available.  Uncertainty  enters  the  picture  in  the  form  of  the 
actual  measurements  obtained  by  the  sensors,  which  can  be 
treated  as  external  disturbances  about  which  we,  as 
designers  of  a  sensor  management  and  fusion  system,  have 
no  control  or  previous  knowledge.  Additional  disturbances 
include  sensor  actions  over  which  the  system  has  no  control 
-  for  example,  sensor  systems  which  are  allocated  at  a 
higher  command  level.  Indeed,  the  current  state  of  the 
fusion  system  represents  the  best  possible  guess  about  the 
actual  ground  truth  -  taking  into  account  the  information 
available  and  our  capacity  to  process  it.  Since  the  estimate 
does  not  depend  on  probabilities  of  obtaining  specific  data 


Fig.  1  Sensor  Resource  Management  (SRM)  closes  the  sensor/ fusion 


control  loop. 
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in  the  future,  the  system  is  essentially  causal,  a  fact  that 
simplifies  conceptually  the  design  of  a  sensor  management 
algorithm.  Of  course  the  variable  dimensionality  of  the 
state  space  precludes  the  use  of  textbook  control  design 
techniques,  which  are  not  likely  to  be  applicable  in  any 
event. 

A  number  of  different  approaches  to  the  design  of  sensor 
managers  have  been  proposed  in  the  literature.  They  cover 
the  different  aspects  of  the  sensor  management  problem 
including  how  to  manage  sensors  to  support  detecting  and 
localizing  [3],  [7],  [8],  [9];  tracking  [2],  [8],  [10],  [11], 
[12];  and  classifying  [4],  [5],  [6]  targets.  The  proposed 
solutions  include  policies  based  on  information-theoretic 
optimization  criteria  [8],  [11]  as  well  as  policies  for 
optimizing  more  traditional  criteria  (e.g.,  track  error) 
generated  using  stochastic  optimization  techniques  such  as 
index  rules  [2],  [5],  [12];  Lagrangian  relaxation  [6];  et  al. 
[3],  [4],  [7],  [9],  [10].  In  this  paper,  we  overview  some  of 
the  technical  issues  in  sensor  management  including 
structure  in  the  problem  that  we  believe  can  be  exploited 
when  designing  solution  techniques.  This  is  discussed  in  a 
stochastic  dynamic  programming  framework  in  Section  II. 
In  Section  III,  we  illustrate  situations  in  which  we  believe 
sophisticated  sensor  management  strategies  are  especially 
beneficial  with  two  examples.  One  example  is  the 
management  of  a  single  sensor,  and  the  other  is  the 
management  of  multiple  sensors.  The  focus  of  both 
examples  is  on  air-to-ground  tracking. 

II.  Approximate  Stochastic  Dynamic  Programming 
Approach 

We  have  conceived  designs  to  the  sensor  management 
control  problem  in  the  framework  of  stochastic  dynamic 
programming.  A  typical  formulation  starts  with  the  system 
state  at  time  t,  x(t).  The  state  includes  all  target  true 
positions  and  types.  A  control  at  time  t,  u(t),  specifies  a 
measurement  of  the  system  to  be  taken.  The  measurement 
may  be  corrupted  by  a  stochastic  disturbance  v(t)  and  may 
be  delayed  so  that  it  is  not  realized  until  a  later  time.  The 
measurement  process  is  given  by  the  function  h,  so  that 

v(ty)  =  h(x(t),u(t),v(t))  (1) 

is  the  measurement  realized  at  time  ty>t.  The  information 
about  the  system  at  time  t  is  summarized  in  the  information 
state  I(t),  consisting  of  all  past  measurements  and  controls 

lit)  =  [y(ty ) :  ty  <t}u  {u(tu  ):tu<t}.  (2) 

The  delay  in  realizing  the  measurement,  Ay,  taken  at  time  t, 
is  a  function  of  the  information  state,  control,  and 
stochastic  disturbance  at  time  t  so  that 

ty=t  +  \y(I(t),u(t),v(t)).  (3) 

Control  decisions  occur  at  discrete  instants  in  time,  tu0, 
tu,i,  tu2,  —  .  Following  time  tui,  the  next  control  is  executed 


after  the  delay  of  Au,  which  is  a  function  of  the  information 
state,  control,  and  stochastic  disturbance  at  time  tu  i.  Thus, 

tu,M  =  fu,i  +  A  i1  (tuA  U(tuA  V<X,1»  •  (4) 

The  control  is  chosen  from  a  constraint  set  U(I(t)) 
according  to  a  control  law,  /v,  which  is  a  function  of  the 
information  state  and  time.  Thus, 

u(tu,i)  =  A1  ih Af «,<•)  •  (5) 

The  sensor  management  policy  is  the  collection  of  these 
control  laws 


7T  =  {jU(I(t),t)}.  (6) 

Rewards  are  achieved  upon  executing  the  policy  by 
attaining  particular  information  states.  The  reward  for 
attaining  information  state  I(t)  is  given  by  R(I(t)).  These 
rewards  are  discounted  by  the  factor  e 71  and  integrated 
across  time  to  yield  an  expected  reward  for  executing 
policy  -rfrorn  the  information  state  1(0)  of 


J„{m)  =  E\e-nR{I{T))dT.  (7) 

o 

The  optimal  sensor  management  policy  n  i  s  the  one  that 
maximizes  (7)  over  all  policies  n.  The  optimal  policy  can 
be  characterized  in  terms  of  Bellman’s  equation  [1].  In  this 
context,  the  equation  states  that  the  expected  reward  for  the 
optimal  policy  satisfies 


J  (7(0)  =  max  E 

K«)eC/(/(0) 


'<+ A„  (I(OMOMO) 

J  errR)I(T))dr  + 

t 

7*(/(f  +  A„(/(0,M(0,v(0)) 


.(8) 


The  first  term  on  the  right-hand  side  is  the  reward  accrued 
until  the  next  decision  time  after  t.  The  second  term  is  the 
expected  reward  after  that  time  accrued  from  the  resulting 
information  state.  The  policy 

X*  ={jU*(I(t),t)}  (9) 


is  optimal  provided  that  the  argument  of  the  maximum  in 
(8)  is  given  by  /j*(I(t),t)  for  all  I(t)  and  t  (the  assumption 
here  is  that  the  set  of  candidate  controls  is  compact,  if  not 
finite,  so  that  the  maximum  is  well-defined).  Several 
computational  techniques,  including  both  policy  and  value 
iteration,  exploit  the  characterization  in  (8)  to  compute 
policies.  The  difficulties  in  exploiting  this  characterization 
are  tied  to  the  size  of  the  state  space,  the  set  of  candidate 
controls,  and  the  set  of  stochastic  disturbances.  In 
particular,  Bellman’s  equation  characterizes  J  for  all 
possible  information  states  I(t)  by  evaluating  the  right-hand 
side  of  (8)  for  all  possible  controls  u(t),  taking  an 
expectation  over  all  disturbances.  This  can  be  difficult  to 
apply  when  the  size  of  the  sets  involved  is  large. 

However,  there  is  special  structure  that  can  be  exploited. 
Consider  the  following  special  case  in  which  the  system 
state  is  the  aggregate  state  of  n  targets 
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x(t)  =  {xt  (t),...,xn(t)}  (10) 

whose  individual  states  x/t)  are  independent  and  evolving 
in  time  as  Markov  processes.  This  would  be  the  case,  for 
example,  when  tracking  independent,  isolated  targets. 
Moreover,  suppose  the  measurements  of  the  system  state 
are  conditionally  independent  given  target  state  and  sensor 
controls  so  that  one  can  write 

y(t)  =  {Vi(t)\i  =  \...n)  (11) 

where  an  individual  measurement  can  be  written 

7,(0  =  0(7(0,w(0,v,(0)  (12) 

for  independent  stochastic  disturbances  v/t).  Independence 
introduces  considerable  structure;  however,  the  problem  is 
still  complex  since  the  information  states  of  the  system  do 
not  have  similar  independence  properties.  For  example,  one 
can  consider  partitioning  the  information  state  as 

lit)  =  h  (0  u  h  (0  u  u  !«  (0  u  Iu  it)  (13) 

where 

Ijit)  =  {vjity):ty<t}  (14) 

and 

Iuit)  =  {<tu):tu<t).  (15) 

However,  the  future  information  states  I/t)  for  t>t  are 
neither  independent  nor  conditionally  independent  given 
the  current  control  u(t)  and  system  state  x(t).  The  reason  is 
that  the  information  states  of  targets  are  coupled  through 
the  control  decisions.  Thus,  one  cannot  rely  on  methods  for 
computing  sensor  management  policies  that  require  the 
independence  of  the  targets’  information  states. 

One  approach  we  have  used  to  develop  sensor 
management  policies  that  exploit  the  special  structure  is  the 
application  of  index  rules  [1],  [13].  Index  rules  are  optimal 
for  the  following  type  of  sensor  management  problem. 
There  are  n  targets,  whose  states  are  independent.  A 
measurement  can  be  made  of  only  one  target  at  a  time,  and 
the  measurement  is  of  fixed  duration,  i.e.  Ay  and  A„  are 
constants  and  AV<A„.  The  state  of  the  target  can  only 
change  at  instants  when  a  measurement  is  made  of  it  (e.g. 
the  target  state  may  not  be  changing,  but  the  information 
state  of  the  target  may  be  as  more  measurements  are 
acquired).  In  addition,  the  mission  must  be  formulated  such 
that  the  reward  R(I(t))  accrued  in  a  particular  information 
state  at  time  t  depends  only  on  the  information  state  I/t)  of 
the  target  j  being  measured  at  that  time.  In  this  case,  the 
optimal  policy  determining  the  next  target  at  which  to  look 
from  information  state  I(t)  is  given  by  an  index  rule,  which 
has  the  form 

MiHt))  =  arg  max  mj  iJj  it))  (16) 

A{  l, ...,«} 

where  m/I/t))  is  the  index  of  the  target.  The  index  for 
target  j  can  be  represented  in  terms  of  a  single  target 
problem.  We  have  been  able  to  develop  solutions  to  these 


single  target  problems  and  apply  the  resulting  index  rule 
policy.  Although  the  assumptions  required  for  the  index 
rule  to  be  optimal  are  often  violated  in  sensor  management 
problems  (e.g.  one  may  be  able  to  measure  the  state  of 
more  than  one  target  at  a  time),  we  have  found  that  index 
rules  may  still  be  optimal  or,  at  least,  applicable  as  part  of 
heuristics  [5],  [14]. 

Another  approach  we  have  used  to  develop  sensor 
management  policies  is  to  use  limited  lookahead  algorithms 
[1].  A  limited  lookahead  policy  is  one  for  which  the  control 
action  is  chosen  as  the  solution  to 


max  E 

«(rte£/(/(0) 


I+A  (/(<), k(<),v(0) 

I 


t 


e~/TR{I(z))dz  + 


Jii1  {t + KViOMOMt))) 


.(17) 


where 


J /!/))  = 


max  E 

»(,)s£/(/(  0) 


<+A„(/(0,u(0.v(0) 

|  e~nR(I(T))dz  + 

t 

•7w(/(?  +  A„(/(0  ,u(t),v(t))) 


.(18) 


for  k=l,...,N-l  where  N  is  the  number  of  steps  of 
lookahead  and  the  terminal  reward  J N  is  chosen  to 

approximate  the  expected  reward.  The  algorithm  for 
computing  the  limited  lookahead  policy  is  effectively 
enumerating  possible  controls  and  outcomes  over  N  steps, 
calculating  a  reward  for  the  resultant  state  based  on  an 
approximation,  and  selecting  the  control  that  yields  the  best 
outcome.  Structure  in  the  problem  can  be  exploited  in  the 

construction  of  J N .  As  noted  previously,  the  problem  has 


special  structure  in  that  individual  target  state  evolutions 
are  often  independent.  One  approach  to  exploiting  this  is  to 
use  an  approximate  terminal  reward  that  is  separable  so  that 

Jiv=Z^(^(0u/1((0)  (19) 

j= 1 

for  per-target  rewards  JN  . .  These  can  be  constructed  a 

number  of  different  ways.  One  technique  we  have  used  is 
to  calculate  the  expected  rewards  associated  with  a  single¬ 
target  form  of  the  problem,  motivated  by  the  index 
definition  in  [13].  Essentially,  we  use  a  function  of  the 

index  /w;  as  the  approximation  JN  . .  We  have  also 
explored  other  methods  for  constructing  JN  including 


rollout  and  heuristic  methods.  In  each  case,  we  have  tried 
to  exploit  structure  in  the  problem  such  as  the  existence  of 
independent  target  evolutions. 


III.  Application  Examples 

What  follows  are  two  examples  of  how  we  have  been 
applying  these  techniques  to  sensor  management  problems. 
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The  examples  outline  how  we  have  applied  the  stochastic 
control  techniques  described  above  to  the  development  of 
sensor  managers  and  illustrate  areas  where  we  have  found 
distinct  advantages  to  using  these  techniques.  In  order  to 
illustrate  the  breadth  of  applicability,  the  examples  are 
drawn  from  two  different  types  of  problems.  The  first  is  the 
control  of  a  single  sensor;  the  second  is  the  control  of 
multiple,  distributed  sensors. 

A.  Control  of  a  Single  Sensor 

In  this  first  example,  consider  managing  a  single  sensor 
air-to-ground  radar  tracking  system.  The  sensor,  tracker, 
and  sensor  manager  are  all  colocated  on  the  sensing 
platform.  As  a  result,  the  latencies  in  transmitting 
information  between  components  are  minimal;  so,  the 
sensor  manager  is  generating  sensor  controls  on  a  fast  time 
scale.  In  this  context,  two  scenarios  in  which  a  stochastic 
control  approach  to  sensor  management  has  advantages  are 
when  there  are  differentiated  targets  and  when  the  sensor 
mode  must  be  matched  to  target  state. 

An  example  of  the  second  scenario  occurs  when  using 
an  airborne  radar  to  track  ground  targets.  In  the  radar’s 
standard  ground  moving  target  indicator  (GMTI)  mode, 
only  targets  moving  against  the  background  can  be 
observed.  However,  the  radar  may  have  another  mode  such 
as  a  fixed-target  indicator  (FTI)  mode,  with  which  only 
stopped  targets  may  be  observed.  In  order  to  track  the 
targets,  the  radar  must  be  managed  to  periodically  revisit 
targets  in  the  appropriate  mode  to  update  the  estimate  of 
their  position.  Too  long  a  period  without  observing  the 
target  will  lead  to  the  tracking  system  dropping  the  track. 
Longer  track  lifetimes  are  desirable.  The  sensor 
management  problem  is  thus  one  of  selecting  the  sequence 
of  targets  at  which  to  look  with  the  radar  as  well  as  the 
mode  to  use.  One  source  of  complexity  in  the  problem  is 
that  targets  may  not  be  detected  even  if  the  appropriate 
mode  is  used.  Thus,  the  sensor  management  policy  must 
appropriately  hedge  to  select  the  best  mode  based  on  past 
detections.  Another  potential  source  of  complexity  in  the 
problem  is  that  the  measurements  are  taken  over  different 
durations  Ay  in  the  different  modes.  Thus,  the  policy  must 
appropriately  hedge  in  time  so  that  longer  duration  modes 
are  not  chosen  at  poor  instants  in  time.  To  address  these 
two  issues,  we  have  developed  a  limited  lookahead  policy, 
of  the  form  described  by  (17)  and  (18).  The  policy  allows 
one  to  account  for  past  detections  as  well  as  for  predictions 
of  future  rewards  that  depend  on  the  different  measurement 
durations  in  each  sensor  mode.  Initial  results  of 
performance  are  illustrated  in  Fig.  2.  Here,  a  simple 
simulation  is  used  to  compute  the  average  track  lifetime  for 
two  different  sensor  policies.  One  is  the  limited  lookahead 
policy;  the  other  is  a  policy  that  only  uses  the  GMTI  mode. 
The  simulation  includes  synthetic  target  motion,  a  simple 
tracker,  and  a  simple  sensor  model.  For  this  sensor  model. 


Performance  Gain  from  Managing  Sensor  Modes 


Fraction  of  Time  the  Target  Is  Stopped 

Fig.  2.  The  curve  plots  the  fractional  increase  in  track  lifetime  for  using  a 
limited  lookahead  sensor  policy  that  controls  sensor  mode  over  a  simple 
single  mode  policy.  In  this  example,  a  different  sensor  mode  must  be  used 
to  observe  the  target  when  it  is  stopped  than  when  it  is  moving.  However, 
the  sensor  will  only  detect  a  target  in  the  proper  mode  with  some 
probability  less  than  1 . 

the  measurement  durations  are  the  same  for  the  two 
different  modes.  The  results  indicate  that  constructing  a 
sensor  policy  that  takes  advantage  of  the  FTI  sensor  mode 
has  the  potential  to  provide  significant  improvements  in 
track  lifetime.  More  realistic  simulations  would  be  required 
to  determine  the  precise  benefit. 

The  other  example  of  a  scenario  for  which  we  have 
noted  benefits  of  sensor  management  is  one  in  which  there 
are  differentiated  targets.  Specifically,  a  subset  of  the 
tracked  targets  is  designated  by  a  user  to  be  higher  priority 
than  the  others.  The  high-priority  targets  could  have 
different  tracking  requirements  than  the  low-priority 
targets.  For  example,  they  may  have  more  stringent  track 
accuracy  requirements.  The  specific  context  considered 
here  is  air-to-ground  tracking  with  a  GMTI  radar.  Thus, 
there  is  no  mode  selection  problem  for  the  sensor  manager, 
as  in  the  previously  discussed  scenario.  However,  the 
problem  of  selecting  the  sequence  of  targets  at  which  to 
look  is  more  complex.  The  sensor  management  policy  must 
account  for  the  different  numbers  of  high-and  low-priority 
targets,  the  different  tracking  requirements,  and  the  current 
state  of  tracks  to  generate  a  control  sequence  that  generates 
measurements  of  targets  to  meet  the  tracking  requirements. 
Some  initial,  simple  simulations  indicate  that  significant 
benefits  can  be  realized  from  a  good  sensor  management 
policy.  In  particular,  we  simulated  a  scenario  with  a  high- 
priority  target  and  several  low-priority  targets.  Two  limited 
lookahead  sensor  management  policies  were  evaluated. 
One  used  one  step  of  lookahead  (N=l),  and  the  other  used 
two  steps  ( N=2 ).  Both  policies  performed  equally  well  on 
the  high-priority  target.  However,  the  two  step  lookahead 
policy  achieved  track  accuracy  requirements  on  the  low- 
priority  targets  86%  more  of  the  time  than  the  one  step 
lookahead  policy.  This  suggests  that  significant  benefits 
can  be  realized  by  appropriately  managing  the  sensor  to 
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track  differentiated  targets.  We  are  currently  planning  to 
evaluate  the  benefits  in  this  type  of  scenario  with  a  more 
realistic  simulation. 

B.  Control  of  Multiple,  Distributed  Sensors 

The  second  example  differs  from  the  previous  one  in  two 
key  respects.  The  first  is  a  decomposition  of  the  sensor 
resource  management  function  into  two  parts:  an 
information  valuation  step  followed  by  a  sensor  allocation 
calculation  (i.e.,  constructing  a  sensor  scheduling  plan  that 
maximizes  the  value  of  collected  information  subject  to 
constraints  on  sensor  availability  and  routing).  The  second 
is  the  introduction  of  multiple  sensors  into  the  problem.  In 
this  example,  we  focus  on  the  information  valuation  aspect 
for  multiple  sensors  of  differing  capability. 

As  described  above,  the  fusion  state  is  determined  by 
both  the  stochastic  evolution  of  the  real  system  and  the 
stochastic  results  of  sensor  measurements  of  that  system. 
Different  sensor  tasking  choices  will  thus  result  in  different 
evolutions  of  the  system’s  state.  The  decision  becomes  one 
of  determining  the  optimal  valuation  of  sensor  resources 
with  respect  to  their  impact  on  the  fusion  process.  While 
there  are  multiple  reasons  for  requesting  particular  sensor 
tasks,  the  approach  described  herein  addresses  an  important 
subset  -  requesting  sensor  tasks  that  will  either  improve 
target  track  estimates  or  remove  association  ambiguity  in 
the  current  or  near-future  fusion  state.  To  emphasize  this 
aspect  of  the  approach,  the  algorithm  has  been  termed 
FIND  (Fusion  Information  Needs  Determination). 

The  goal  of  the  FIND  algorithm  is  to  maximize  the  time 
discounted  reward  J  in  (7)  for  the  special  case  where  the 
time  between  control  actions  Au  is  constant  so  that  one  can 
rewrite  it  for  a  constant  a  as 
00 

J  =  ^a,R(I(ty).  (20) 

t= 0 

The  reward  function  R  has  the  form 

R(f(t))  =  YjRj(fj(t)ufii(t))  (21) 

i 

where  the  index  j  in  this  case  ranges  over  the  hypothesis 
space  of  the  fusion  system.  The  track  hypothesis  space 
contains  information  about  the  relative  certainty  of 
different  data  associations  that  are  not  reflected  in  the 
single  global  set  of  track  estimates  normally  output.  The 
individual  rewards  Rj  are  a  function  of  a  set  of  goals  and 
priorities,  specifically: 

1 .  The  required  kinematic  accuracy  (expressed  as 
tracking  uncertainty,  ZGoai)  for  tracking  confirmed 
targets 

2.  The  required  classification  accuracy  (probability  of 
correct  classification,  PGoal)  for  declaring  high 
confidence  identification  of  a  target 

3.  The  relative  priorities  for  meeting  the  kinematic  and 


classification  accuracy  goals,  both  singly  and  in 
combination,  for  each  of  the  expected  target  types 

4.  Indications  of  time-criticality  of  the  information  need. 
Given  this  information,  we  can  specify  the  reward  for  a 
given  hypothesis  Hj.  The  reward  takes  on  differing  values 
depending  upon  which  combination  of  the  goals  is 
satisfied.  For  hypothesis  Hj,  with  associated  kinematic 
uncertainty  af(t)  (the  maximum  eigenvalue  of  the  position 
error  covariance)  and  classification  probabilities  pj(t) 
(defined  as  the  vector  of  probabilities  that  the  target  is  of  a 
given  type),  the  individual  reward  at  time  t  is  given  by: 


7?.(/.(r)u/„(f»  = 


0 

R* 

Ry 


0-/(0  >  2 Goa,  and  ma x(Pj(t))  <  PGoal 
°]U)  <  ZGoa,  and  max(pj  (t ))  <  PGoal 
o-;  (0  >  ^ooai  and  max( pft))  >  PGoal 

<?](t)<Y.Goa,  and  max( pftj)  ^  PGoal 


Different  candidate  sensor  tasks  are  valued  using  a  1- 
step  limited  lookahead  approach  given  by  (17)  and  (18).  A 
heuristic  approximation  of  the  terminal  award  is  given  by 
the  separable  function 


=  Ya(pTjpPj+RJ 


1  CLr  j  —o'2) 

—  arctan  — — - —  +  0.5 

n  r 


(22) 


J 


where  the  summation  is  over  the  different  hypotheses 
within  the  track  hypothesis  space.  The  FIND  values  are 
computed  as  the  increment  in  the  expected  reward  of  the 
one-step  lookahead  for  a  set  of  candidate  sensor  tasks 


v(m,u(t\  a„)= 

FIND  value 

E  [  J(/(0  u  y(t  +  A „ )  u  u(t))  ]  -  J(/(0) 

Incremental  reward 


(23) 


Since  FIND  does  not  have  information  as  to  which  specific 
sensors  are  available,  the  FIND  value  is  computed  for  a  set 
of  candidate  sensor  controls  u  parameterized  by  hypothesis 
as  well  as  a  range  of  kinematic  measurement  accuracies 
and  classification  abilities. 

The  FIND  valuation  is  used  to  determine  the  benefit 
derived  from  tasking  a  sensor  to  provide  information  on  a 
specified  hypothesis.  In  practice,  these  valuations  are  rank- 
ordered  and  filtered  such  that  only  a  subset  of  the  possible 
hypotheses  is  considered  in  the  sensor  allocation 
calculation.  This  portion  of  the  solution  balances  the  set  of 
valuations  (which  vary  with  sensor  performance)  against 
competing  requirements  (e.g.,  requests  produced  at  a  higher 
command  level)  to  produce  a  multiple  sensor  tasking  plan. 

To  illustrate  the  performance  of  the  FIND  algorithm  and 
demonstrate  its  utility  for  identifying  (and  quantifying)  the 
benefits  of  candidate  sensor  taskings,  consider  the  simple 
scenario.  It  begins  with  a  single,  stationary,  high  priority 
target.  Initial  information  about  the  target  consists  of  good 
classification,  but  poor  kinematic  information.  A  short  time 
later,  two  distinct  tracks  are  reported  by  an  MTI  system. 
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While  these  reports  provide  good  kinematic  information, 
target  classification  knowledge  is  poor.  The  problem 
becomes  one  of  identifying  which,  if  either,  of  the  moving 
targets  is  the  original  high  priority  one. 

Three  approaches  for  generating  sensor  task  valuations 
were  examined: 

1.  Raster  which  simply  tasks  the  sensor(s)  to  address 
each  hypothesis  in  turn 

2.  Myopic  which  implements  a  1-step  lookahead,  greedy 
approach.  Defaults  to  Raster  if  no  sensor  task  is 
expected  to  achieve  a  goal 

3.  FIND  which  implements  a  1-step  lookahead  and  uses 
the  heuristic  terminal  reward  in  (22)  to  approximate 
the  long-time  reward. 

Two  sensors  are  available  for  tasking.  Nominally,  the  first 
provides  accurate  kinematic  information  but  no 
classification  data,  while  the  second  provides  classification 
information  but  has  a  poorer  (i.e.,  larger)  kinematic 
uncertainty.  Each  is  assumed  to  report  the  results  of  the 
tasked  observation.  The  FIND  problem  is  to  produce  a 
sensor  tasking  (or  set  of  sensor  tasks)  that  resolves  the 
inherent  ambiguities  in  the  hypothesis  space  while 
minimizing  the  number  of  such  tasks.  This  is  equivalent  to 
producing  a  set  of  recommended  sensor  taskings  that 
results  in  the  best  (minimum  number  of  observations) 
solution  to  achieving  the  target  tracking  and  classification 
goals. 

Fig.  3  illustrates  how  the  above  algorithms  perform  for 
one  set  of  evaluation  conditions.  The  curves  are  the 
probability  that  the  tracking  and  classification  goals  are 
exceeded  as  a  function  of  the  number  of  recommended 
sensor  taskings.  The  results  shown  in  the  figure  are 
representative;  the  FIND  algorithm  is  clearly  superior  to 
the  other  approaches. 

The  valuations  provided  by  the  FIND  algorithm  can  be 
viewed  as  providing  different  types  of  requests  to  improve 
fusion  performance.  The  highest  value  requests  are  those 
which  remove  ambiguity  in  report  associations  to  high 
priority  targets.  Requests  that  confirm  ID  and  track  likely 
high  priority  targets  typically  have  medium  values,  while 
those  with  the  lowest  value  are  requests  to  ID  unknown 
targets  and  are  usually  ignored  unless  no  higher  value  tasks 
are  requested  for  a  given  sensor  resource. 

IV.  Conclusion 

The  examples  in  the  previous  section  highlight  issues  in 
sensor  management  and  indicate  how  one  could  exploit 
structure  resulting  from  independent  target  motions  to 
develop  a  sensor  management  policy.  The  results  indicate 
that  such  policies  will  appropriately  allocate  sensor 
resources  to  improve  the  resolution  of  hypotheses  in  multi¬ 
target  tracking  systems  and,  specifically,  to  improve  the 
surveillance  of  high-priority  targets.  Further 
experimentation  is  required  to  determine  the  precise  degree 


Fig.  3.  FIND  significantly  reduces  the  number  of  sensor  taskings  required 
to  achieve  performance  goals. 

to  which  benefits  can  be  realized  in  practice.  Planned 
development  of  high  fidelity  simulations  will  allow  us  to 
perform  the  necessary  experiments.  We  expect  results  will 
indeed  confirm  that  significant  benefits  can  be  realized. 
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Abstract  -  We  consider  the  sensor  management  problem 
arising  in  using  a  multi-mode  sensor  to  track  moving  and 
stopped  targets.  The  sensor  management  problem  is  to 
determine  what  measurements  to  take  in  time  so  as  to 
optimize  the  utility  of  the  collected  data.  Finding  the  best 
sequence  of  measurements  is  a  hard  combinatorial 
problem  due  to  many  factors,  including  the  large  number 
of  possible  sensor  actions  and  the  complexity  of  the 
dynamics.  The  complexity  of  the  dynamics  is  due  in  part 
to  the  sensor  dwell-time  depending  on  the  sensor  mode, 
targets  randomly  starting  and  stopping,  and  the 
uncertainty  in  the  sensor  detection  process.  For  such  a 
sensor  management  problem,  we  propose  a  novel, 
computationally  efficient,  farsighted  algorithm  based  on 
an  approximate  dynamic  programming  methodology.  The 
algorithm’s  complexity  is  linear  in  the  number  of  targets. 
We  evaluate  this  algorithm  against  a  myopic  algorithm 
optimizing  an  information-theoretic  scoring  criterion. 
Our  simulation  results  indicate  that  the  farsighted 
algorithm  performs  better  with  respect  to  the  average 
time  the  track  error  is  below  a  specified  goal  value. 

Keywords:  Tracking,  sensor  management,  farsighted 
strategy,  stochastic  dynamic  programming. 

1  Introduction 

We  consider  the  problem  of  sensor  management  arising  in 
tracking  multiple  targets  with  a  multi-mode  sensor.  The 
sensor  management  problem  is  to  determine  which  target 
to  be  observed  by  the  sensor  and  which  sensor  mode  to 
use.  These  decisions  are  to  be  made  over  time  so  as  to 
optimize  the  utility  of  the  collected  measurements.  An 
example  of  the  sensor  management  problem  of  interest  is 
that  of  managing  a  multi-mode  airborne  radar  to  track 
moving  and  stopped  ground  targets.  Specifically,  the 
radar  may  have  two  modes,  a  moving  target  indicator 
(MTI)  mode  for  observing  moving  targets  and  a  fixed- 
target  indicator  (FTI)  mode  for  observing  stopped  targets. 
Each  mode  is  characterized  by  the  uncertainties  in  the 
target  detection  and  measurement  processes  as  well  as  the 


measurement  collection  time.  These  may  all  be  different  in 
the  different  modes.  The  radar  controls  include  which 
mode  to  use  as  well  as  where  to  point  the  radar  for  a 
particular  measurement.  The  objective  is  to  collect  enough 
data  on  target  position  over  time  to  meet  desired  track 
error  goals .  Detennining  how  to  optimally  manage  a 
sensor  over  time  to  meet  such  objectives  is  a  hard 
combinatorial  optimization  problem.  The  problem 
complexity  stems  from  many  factors,  including  the  large 
number  of  possible  sensor  actions  as  well  as  the 
complexity  of  the  dynamics.  The  complexity  of  the 
dynamics  results  partially  from  the  radar  dwell-time 
depending  on  the  radar  mode,  targets  randomly  starting 
and  stopping,  and  the  uncertainty  in  the  sensor  detection 
process.  As  a  result  of  these  factors,  computing  optimal 
sensor  management  strategies  is  often  infeasible. 

Various  strategies  have  been  proposed  for  sensor 
management  including  strategies  that  use  information- 
theoretic  scoring  criterions  such  as  those  developed  in  [1] 
for  tracking,  and  [2]  and  [3]  for  collaboration  of  networked 
sensors.  Some  farsighted  strategies  have  been  developed 
in  [4],  [5],  and  [6]  for  tracking.  More  specifically,  the  work 
in  [6]  evaluates  some  farsighted  strategies  and  compares 
them  to  a  myopic  strategy.  A  farsighted  strategy  is  one 
where  the  sensor  manager  considers  the  benefits  resulting 
from  a  sequence  of  (two  or  more)  sensor  actions,  while  a 
myopic  strategy  is  one  where  the  sensor  manager 
considers  only  the  benefits  resulting  from  a  single  sensor 
action.  In  [6],  the  evaluations  of  these  two  kinds  of 
strategies  are  performed  on  the  problem  of  managing  a 
single  mode  sensor  to  track  targets  that  may  become 
occluded. 

The  work  presented  in  this  paper  is  also  motivated 
by  an  interest  in  comparing  the  benefits  of  farsighted 
strategies  with  that  of  myopic  strategies.  In  contrast  to 
previous  work,  the  investigation  presented  here  considers 
a  novel  farsighted  algorithm  and  problem  for  evaluation. 
In  particular,  we  consider  the  problem  of  move/stop 
tracking  outlined  above,  which  has  not  previously  been 
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considered  for  evaluating  the  relative  benefits  of 
farsighted  sensor  management  strategies.  For  this 
problem,  we  develop  a  novel,  efficiently  computable 
farsighted  sensor  management  strategy. 

In  the  remainder  of  this  paper,  we  present  the 
algorithms  for  evaluation  and  the  results  of  that 
evaluation.  In  Section  2,  we  describe  the  farsighted  sensor 
management  strategy,  which  is  based  on  an  approximate 
stochastic  dynamic  programming  methodology.  In  Section 
3,  we  introduce  a  myopic  sensor  management  algorithm 
that  serves  as  a  baseline  for  the  evaluation  of  the 
farsighted  sensor  management  algorithm  hi  Section  4,  we 
report  some  simulation  results  comparing  the  two 
algorithms.  In  Section  5,  we  give  some  concluding  remarks. 

2  Farsighted  Strategy 

Our  development  of  the  farsighted  strategy  is  based  on  a 
formulation  of  the  sensor  management  problem  of  interest 
as  a  stochastic  dynamic  program.  This  formulation  is 
discussed  in  the  following  section. 


U  ={(/,|J.) |ie  {l,...,n},(ie  {MTI, FTI}} . 

This  set  can  be  modified  to  include  other  candidate 
locations  to  accommodate  a  more  general  case  when  the 
sensor  may  observe  more  than  one  target  at  a  time. 

Given  that  a  control  u(tk )  has  been  selected  at  time  tk, 
the  new  state  of  the  system  depends  on  the  control  u(tk) 
and  the  resulting  measurement  outcome.  The  dynamics  of 
the  state  are  given  by  a  track  filter,  such  as  an  interacting 
multiple -model  (IMM)  filter  (see  [8]).  To  simplify  the 
notation,  for  the  target-state  and  system-state  evolution, 
respectively,  we  will  write 

x,  (0  =  /  (fy(4 ),  u(tk), w,  t-tk), 

x(t)  =  F (x  (tk ),u(tk),w,t- tk ), 

where  w  is  the  detection  outcome  with  w=l  or  vv=0 
indicating  detection  and  missed  detection,  respectively. 


2.1  Formulation  As  a  Dynamic  Program 

We  formulate  our  sensor  management  problem  as  a 
continuous-time  stochastic  dynamic  programming  problem 
with  an  infinite  horizon  (cf.  [7]).  The  system  to  be 
controlled  is  the  tracker,  whose  state  x(t)  at  time  t  is 
composed  of  the  target  track  states  x,(t),  i.e., 

x(t)  =  {xl(t),...,xn(t)). 


Now,  we  define  a  reward  r,  for  each  target  i,  which  is 
nonzero  only  when  the  target-track  error  is  sufficiently 
small.  In  particular,  for  a  specified  track  error  goal  G, 
associated  with  target  i,  the  reward  r,  is  nonzero  if  the 
mean-squared  track  error,  the  trace  of  the  error  covariance, 
is  below  G,.  Otherwise,  r,  is  zero.  Formally, 


4(fy(0)  : 


,f  rr(5,(0)<G,, 

otherwise, 


(1) 


where  «  is  the  number  of  targets.  The  target  track  state  x,(t) 
is  given  by 

xi(t)  =  {Si(t),pim(t),pis(tj), 


where  V<  is  the  priority  value  for  target  i  and  Tr(-)  denotes 
the  trace.  In  essence,  the  reward  is  collected  only  when  the 
track  accuracy  is  high  enough.  The  total  reward  R  at  state 
x(t)  is  the  sum  of  the  target  rewards, 


where  S,(t)  is  the  position-error  covariance  for  target  i,  and 
pim(t )  and  pjf)  are  the  probabilities  that  the  target  is 
moving  or  is  stopped  at  time  t,  respectively.  These 
probabilities  can  be  estimated,  for  example,  by  computing 
the  target-mode  likelihoods  based  on  the  target  detection 
history. 

The  candidate  controls  include  the  location  of  a 
sensor  dwell  and  the  mode  of  the  sensor.  For  clarity  of 
exposition,  we  assume  that  the  sensor  may  observe  only 
one  target  at  a  time .  In  particular,  we  say  that  the  sensor 
observes  target  i  when  the  sensor  is  actually  pointed  at 
the  estimated  position  of  target  i.  Therefore,  at  any  state 
and  time,  the  set  of  candidate  controls  U  consists  of 
target-radar  mode  pairs,  i.e., 


^C<0)  =  £*;(*, ■(*))• 

i=l 

The  sensor  manager  goal  is  to  select  the  sequence  of 
controls  {w*(4)|fc=0,l,...}  maximizing  the  expected  reward 
accumulated  over  time.  Formally,  the  sensor  manager  is 
solving  the  following  problem 

max  E  f R(x( t))F!' clt  I  x(0)  =  x,  , 

{«<>t)|*=0,l...}  J 

where  ?  is  a  decay  factor  modeling  the  information 
depreciation  in  time  and  ,r0  is  the  initial  state  of  the  system. 
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According  to  dynamic  programming  theory,  there  is 
an  optimal  policy  p  specifying  an  optimal  control  for  each 
state.  Furthermore,  every  optimal  policy  satisfies  Bellman’s 
equation: 


J\x)  = 


max 

ueU 


tu 

J  R(x(t))evdt 
0 


+  e~v“  E  J*(F(x  n  }v  /„)) 

w 


where  tu  is  the  time  needed  to  complete  the  sensor  task 
specified  by  a  control  u,  and  J*(x)  is  the  expected  reward 
accumulated  under  the  optimal  policy  p  when  the  system 
starts  at  state  x.  Note  that,  to  find  an  optimal  policy  using 
Bellman's  equation,  the  equation  has  to  be  solved  for  all 
states  x.  When  the  number  of  states  is  large  or  the  number 
of  control  choices  is  large,  solving  Bellman’s  equation  is 
computationally  prohibitive  due  to  the  combinatorial 
explosion  of  state-control  combinations.  This  is  exactly  the 
case  with  our  sensor  management  problem,  where  the 
computational  complexity  grows  exponentially  with  the 
number  of  targets,  the  number  of  target  track  states,  the 
number  of  sensor  modes,  and  the  number  of  sensor  dwells. 
For  example,  the  number  of  target  track  states  grows 
exponentially  with  the  number  of  sensor  dwells,  and  can 
be  large  even  for  a  single  target.  Many  different 
techniques  for  generating  suboptimal  policies  have  been 
developed.  We  consider  a  rollout-based  approach,  as 
discussed  in  the  following  section. 

2.2  Rollout  Strategy 

The  rollout  strategy  is  an  approximate  dynamic 
programming  technique  (see  [7],  Vol.  1,  p.  314).  This 
technique  evaluates  a  control  action  by  estimating  near 
and  far-future  benefits  resulting  from  the  control  choice  at 
the  current  state.  The  near-future  benefits  are  computed 
by  predicting  the  action  consequences  over  the  look¬ 
ahead  planning  stages.  The  far-future  benefits  are  the 
benefits  accumulated  after  the  look-ahead  stages.  In  the 
rollout  approach,  the  far-future  benefits  are  computed  as 
the  benefits  resulting  from  applying  a  fixed  policy.  This 
approach  as  illustrated  in  Figure  1. 


Figure  1.  The  rollout  approach  evaluates  near  and  far- 
future  benefits  of  an  action  taken  at  the  current  state. 

Formally  speaking,  the  rollout  approach  is  one 
policy -iteration  step  performed  on  some  fixed  policy  p.  In 
particular,  for  a  fixed  policy  p,  one  step  of  the  policy- 
iteration  consists  of  solving  the  following  problem: 


max  | \R{x{t))evdt  +  e~v“  EJn  (F(x  v 

(2) 

where  J%  is  the  expected  reward  of  policy  n.  The  resulting 
overall  policy  is  a  one-step  improvement  of  the  original 
policy  p.  Thus,  it  is  desirable  that  p  is  near-optimal. 
Furthermore,  it  also  desirable  to  select  p  so  that  the 
expected  policy-reward  Jp  is  computable. 

In  the  rollout-based  algorithm,  the  sensor  manager 
selects  a  sensor  action  by  solving  (2).  As  a  first  step 
toward  solving  this  maximization  problem,  we  discretize  the 
time  by  letting 


7a  —  &§  for  8  >  0  and  k  —  0,1,2,...  . 

Then,  relation  (2)  reduces  to 

max]  V  cl1' R(x(x k))+aK"  EJ  (F(x,u  ,w,tk  ))[ 

ueU  [t3  w  "  J 

(3) 

where  K„  is  the  sensor  dwell-time  (in  units  of  5 )  for  the 
sensor  action  specified  by  control  u,  and  ae(0,l)  is  a 
discount  factor  given  by 


a  =  e 


-?d 
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Recall  that  for  any  control  choice  u=(j,p),  the  detection 
outcome  w  can  take  value  1  or  0.  Thus,  the  expectation  in 
relation  (3)  reduces  to 

EJk(F(x,U,W,Xk  ))  = 

W  u 

p  ^  *,)P  jtlJAF(x>u>l>x  k)) 

+  [ ■ 1  -  PM (X  K  )P M ]  JK  ( F (x,u, 0 ,T Ku )), 

where  p  (x  K  )  is  the  probability  that  target y  is  in  mode  p 

at  the  time  of  the  observation,  and  is  the  detection 
probability  for  radar  mode  p  when  observing  target  j. 

To  accommodate  efficient  computation  of  the 
expected  policy-reward,  our  sensor  resource  manager  uses 
a  tracker  predictive  model  approximating  the  tracker.  This 
model  is  based  on  the  following: 

Assumption: 

1.  Each  target  is  either  moving  or  is  stopped,  but  the 
target  motion  state  is  unknown. 

2.  A  target  track  is  dropped  if  the  target  is  not  detected. 

Assumption  1  is  realistic  for  cases  where  the  changes  in 
target  motion  take  longer  than  planning  and  executing  a 
sensor  action.  Assumption  2  is  more  conservative  than 
necessary  (a  target  track  may  continue  even  with  one  or 
more  missed  detections).  However,  the  resulting  model  is 
useful  for  planning  purposes.  Furthermore,  these 
assumptions  restrict  the  branching  of  the  control-outcome 
space  of  any  policy.  This  allows  us  to  evaluate  our 
farsighted  policy  without  using  costly  Monte  Carlo 
simulations. 

We  now  discuss  the  choice  of  policy  n.  Motivated 
by  the  desire  to  have  a  good  policy  whose  expected 
reward  can  be  computed  for  any  initial  state,  we  consider  a 
policy  n  having  the  following  properties: 

1.  A  target  is  observed  with  either  MTI  or  FTI  mode  at  all 
times. 

2.  Initially,  the  targets  are  sorted  in  a  list  according  to 
some  criterion.  Then,  these  targets  are  observed 
according  to  the  list  as  follows: each  target  is  observed 
until  either  its  track  error  decreases  below  the  desired 
value  or  its  track  is  dropped.  If  the  track  error  is 
decreased  below  the  desired  value,  the  target  is 
revisited  at  the  rate  that  keeps  its  track  error  below  the 
desired  value. 


We  assume  that  the  sensor  can  revisit  the  targets  with 
rates  that  keep  the  track  errors  below  the  desired  values . 

We  next  describe  a  procedure  for  selecting  a  radar 
mode  and  a  procedure  for  ordering  the  targets  in  a  list.  In 
particular,  let  x  be  the  system  state  when  the  policy  n  is  to 
be  applied,  i.e., 

x  =  (xl,...,xn)  with  x,.  =  (.S’,,  pim,  pis),i=l,...,n. 

From  the  current  target-mode  probabilities  pim  and  we 
determine  the  most  likely  target  modes 

(l(0  =  argmax{  pim,  p. J  . 

Under  the  policy  ji,  the  radar  always  uses  mode  p(i)  when 
observing  target  i.  Given  the  current  target  covariances  Sn 
i=l,...,n,  the  policy  it  sorts  the  targets  according  to  the 
vicinity  of  their  track  errors  to  the  desired  goal,  i.e.,  the 
targets  are  sorted  according  to  the  values  FifS^/G,  for 
/=1,...,«.  This  order  is  motivated  by  that  generated 
according  to  an  index-rule  policy  such  as  that  discussed  in 
[5], 

We  now  evaluate  the  policy  reward/*.  We  note  that, 
even  for  our  simple  tracker  predictive  model,  it  is  still 
computationally  prohibitive  to  exactly  evaluate  the  reward 
/*  due  to  coupling  of  the  target  dynamics  under  the  policy. 
We  thus  approximate  the  policy  reward  /*  assuming  that  it 
is  separable  across  the  targets,  i.e., 

^(5)  =  E^-(5)  with  S  =  (Sl,...,Sn). 

i=i 

For  notational  convenience,  without  loss  of  generality,  we 
may  assume  that  the  order  of  the  targets  is  1,2 ,...,«  when 
they  are  sorted  according  to  values  TriSf G,.  Suppose 
that  the  mean-squared  errors  for  the  first  ?  targets  are 
within  their  goal  region,  i.e., 

Tr(Si)<Gi  for  i  =  1,...,K. 

For  these  targets,  the  reward  J(S,)  is  the  reward  collected 
during  the  revisit  period,  i.e., 

Jj(S)  =  Lj  for  i  =1,...,K, 

where  L,  is  the  long-term  reward  accumulated  during  target 
revisits  (to  be  discussed  shortly). 

Consider  now  the  targets  i=?+\,..,n.  Their  mean- 
squared  errors  exceed  the  desired  goals.  Under  the  policy 
n  (property  2),  each  of  these  targets  is  observed  until  the 
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track  is  dropped  or  its  mean-squared  error  decreases  below 
the  value  G,.  However,  it  can  be  seen  that  the  accumulated 
reward  is  nonzero  only  if  the  mean-squared  error  decreases 
below  the  value  G,.  Let  T?+l  be  the  observation  time 
required  to  decrease  the  trace  7r(Sm)  below  value  Gm. 
Then,  we  have 


1 _ n  iM 

r(iM)=p(l+a"  t.+a“-"“)=  pb-V. 

1  -a 

where  ?  is  the  reward  accumulated  between  any  two 
consecutive  revisits  and  is  given  by 


yK+10S)  =  (ap 


K  +1,(1  (K+l) 


p  =  y.  (l+a  +  ...+aM_1)  =  y. 


l-aM 

1-a 


While  observing  target  '/+ 1 ,  the  covariances  of  targets 
?+2 ,...,«  evolve  to  S?+2(T?+l),...,Sn(T?+l).  Let  T?+2  be  the 
observation  time  required  to  decrease  the  trace  of 
covariance  S?+2(T?+l)  below  value  G?+2.  Then,  we  have 


The  long-term  reward  L,  is  equal  to  the  expected  reward  G 
collected  during  the  target  lifetime,  so  that  by  computing 
the  expected  value  of  G (jM),  we  can  see  that 


■Ac +2  OS)  —  (aPK+2,pi(K+2)  )  dis¬ 


continuing  in  this  manner,  we  can  see  that 

t  /  r»\  /  O  \7k+1+7k+2  +■■•+%  .  1 

/1(5)  =  (aP,ll(i) )  Li  for  1=K 


where  Tt  is  the  time  required  to  decrease  the  trace  of 
S,(T?+i+. .  ,+Tj.i)  below  goal  value  G,. 

We  next  discuss  the  long-term  rewards  L, 
accumulated  during  periodic  revisits  of  the  targets.  As 
mentioned  earlier,  once  the  traces  of  error  covariances  for 
all  targets  decrease  below  their  corresponding  goal  values, 
the  targets  are  revisited  at  a  constant  rate.  Under 
Assumption  1,  the  target  modes  do  not  change  in  time,  so 
the  error  covariance  of  a  stopped  target  does  not  change 
in  time.  Therefore,  the  stopped  targets  with  error  traces 
below  their  corresponding  goals  are  not  revisited,  and  the 
long-term  reward  L,  is  given  by 


^=fa'v(=  — 

h  1-a 


for  a  stopped  target  i. 


We  focus  now  on  the  long-term  reward  L,  associated  with 
a  moving  target  i.  Let  M  be  the  length  of  the  revisit  interval 
required  for  keeping  the  trace  of  the  target  i  error 
covariance  below  the  desired  value.  Without  loss  of 
generality  we  may  assume  that  the  sensor  revisits  the 
target  i  at  times  t=jM,j=  0,1,....  During  the  revisit  process, 
the  reward  is  collected  for  as  long  as  the  target  is  detected, 
and  the  reward  ceases  when  the  target  is  not  detected  (cf. 
Assumption  2).  Thus,  during  the  revisit  stage,  the  reward 
is  collected  per  unit  of  target  lifetime.  The  target  lifetime  is 
a  random  variable  taking  value  jM  with  probability 

fVm'n  (l  — )  '  Hence,  f  the  lifetime  takes  value  jM, 
then  the  collected  reward  G (jM)  is  equal  to 


V,(l-a“) 

L  = - - - — - -  for  a  moving  target  i. 

(l-a)(l-a“|3w0) 

Concluding  this  section,  we  note  that  the  preceding 
algorithm  description  involved  only  two  sensor  modes 
mainly  for  the  clarity  of  exposure.  The  algorithm  can  be 
easily  extended  to  a  more  general  case  involving  an 
arbitrary  number  of  sensor  modes.  Note  that,  the 
preceding  procedure  for  approximating  the  expected 
reward  Jn  has  polynomial  complexity  in  the  number  of 
targets,  in  contrast  to  the  exponential  complexity  required 
to  compute  the  reward  Jn  exactly. 

3  Myopic  Strategy 

Here,  we  present  a  myopic  sensor  management  algorithm 
that  serves  as  a  baseline  for  evaluating  the  performance  of 
the  farsighted  algorithm  discussed  in  the  preceding 
section.  We  do  not  consider  a  myopic  approach  optimizing 
the  dynamic  programming  formulation  presented  in  the 
previous  section.  This  is  because  the  reward  function  has 
a  threshold  structure  so  that  no  value  may  be  realized  from 
a  single  action.  Instead,  we  consider  an  algorithm  that 
evaluates  sensor  actions  based  on  the  expected  decrease 
in  the  entropy  of  the  target-track  errors  per  unit  of  time. 
The  algorithm  is  myopic  since  the  changes  in  the  entropy 
are  computed  only  for  a  single  sensor  action. 

Specifically,  the  entropy  h,  for  target  i  at  state 
Xi=(Si,Pim,Pis)  is  given  by 


hi  O, )  =  j tog  [2nce  T  r(Si )] ,  (4) 

where  V,  is  the  priority  of  target  i  and  pc~3.14  (see  [9], 
Chapter  9).  As  seen  from  this  relation,  the  target  entropy  is 
measured  by  the  target-track  error  in  log-scale.  The 
entropy  H  of  the  system  at  state  x  is  defined  as  the  sum  of 
the  target  entropies  /?,: 
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H (*)  =  'Y*hi{xi)  where  X  =  (xl,...,xn).  (5) 

/= 1 


4.1  Simulation  scenario 


Let  tk  be  the  current  time  and  x(tk)  be  the  current  state. 
Given  a  control  u  is  selected  at  time  tk,  the  expected 
entropy  decrease  per  unit  of  time  at  state  x{tk)  is  given  by 

E[H(x(tk+]))\-H(x(tk)) 

D(x  (tk  ),u)  =  — - , 

4 

where  tu  is  the  time  required  for  completing  the  sensor 
action  specified  by  control  u,  tk+ 1=  tk  +  tm  is  the  time  the 
detection  outcome  w  is  available,  and.r(4+i)  is  the  state  to 
which  the  system  transitions  under  the  control  u.  From 
relations  (4)  and  (5),  it  can  be  seen  that  for  state  x(tk)  with 
components 

(4  )  =  ($  (4-  X  Pun  (4- ) » Pi  A  4 )) 


and  for  control  u=(j,n),  the  expected  entropy  decrease  is 

D(x(tk),u) 


-Lfcv,i 

2 1,  IS  Tr(S,(tt ) 


•(s,M 


Tv  I 


(6) 


where  S*(tk+l)  is  the  predicted  target  covariance  and 
S. (tk  l)  is  the  updated  target  covariance  according  to  the 
standard  Kalman-filter  equations. 


Figure  2.  Discrete-time  Markov  chain  modeling  target 
motion.  At  any  time,  either  a  target  is  moving  or  it  is 
stopped.  The  transitions  occur  at  times  k 5,  fork  =  1,2,..., 
where  8  is  the  time  increment. 

We  consider  a  10  minute  scenario  in  which  there  are  50 
targets  to  be  tracked  while  they  are  both  moving  and 
stopped  by  a  radar  having  two  modes,  MTI  and  FTI.  We 
assume  that  each  target  moves  along  a  one-dimensional 
road  and  has  normally  distributed  velocity  with  a  specified 
root  mean  square  value.  The  target  transitions  between 
being  moving  and  stopped  are  modeled  by  a  discrete-time 
Markov  chain  with  two  states,  moving  and  stopped,  and 
with  state  dependent  transition  probabilities,  as  illustrated 
in  Figure  2.  By  varying  Psm,  we  simulate  the  cases  where 
the  average  number  of  targets  that  are  stopped  in  steady- 
state  is  20,  30,  and  40,  and  we  initiate  the  number  of  targets 
being  stopped  in  the  scenario  to  the  average  steady-state 
value. 


The  entropy -based  sensor  manager  is  myopic:  at 
state  x(tk),  it  selects  a  control  u  optimizing  the  entropy 
decrease  D.  More  specifically,  the  entropy-based  sensor 
manager  solves  the  following  problem 

min  D(x(tk  ),  u), 

ueU 

where  D(x(tk),u)  is  computed  according  to  equation  (6). 

4  Numerical  Results 

Here,  we  present  our  simulation  scenario  and  the  test 
results  obtained  for  the  farsighted  rollout  algorithm  and 
the  myopic  entropy-optimizing  algorithm.  The  purpose  of 
the  simulations  is  to  determine  if  the  farsighted  algorithm 
has  any  potential  advantages  over  the  myopic  algorithm. 


Our  tracker  model  is  based  on  a  simple  IMM  filter  (cf. 
[8])  consisting  of  a  filter  modeling  the  kinematics  of  a 
“moving”  target  and  another  one  modeling  the  kinematics 
of  a  “stopped”  target.  The  target  track  state  consists  of 
the  target  location  estimate,  the  estimate  of  error  variance, 
and  the  target  mode  probabilities.  The  target  mode 
probabilities  are  updated  given  information  on  whether  a 
target  is  detected  or  not.  In  particular,  if  a  target  is  not 
detected  in  MTI  mode,  then  the  probability  that  the  target 
is  moving  decreases.  A  similar  update  results  from  a 
missed  detection  in  FTI  mode.  The  tracker  drops  a  target  if 
its  mean-squared  error  exceeds  a  specified  upper  bound. 
Furthermore,  the  tracker  report  association  is  assumed  to 
be  perfect. 
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The  MTI  radar  mode  can  detect  moving  targets  only, 
while  the  FTI  mode  can  detect  stationary  targets  only. 
Each  mode  is  characterized  by  its  detection  probability, 


measurement  error  variance,  and  dwell-time.  For  both 
modes  and  all  targets,  the  detection  probability  is  0.9  and 
the  measurement  error  variance  is  1  nr.  The  MTI  mode 
dwell-time  is  0.1  seconds,  and  the  FTI  mode  dwell-time  is 
10  seconds. 

We  assume  that  all  targets  have  the  same  priority 
and  the  same  goal  values  for  their  error  variances.  In 
particular,  in  equation  (1),  we  use  priority  V=\  and  goal 
value  G,= 25  nr  for  all  i. 

4.2  Simulation  results 

In  this  section,  we  present  the  simulation  results  obtained 
for  the  larsighted  sensor  management  algorithm  and  the 
myopic  entropy -optimizing  algorithm  We  use  the  average 
fraction  of  time  the  error  goals  are  met  as  a  measure  of 
performance.  This  is  computed  as  the  fraction  of  time  the 
target  error  goal  is  met  per  target  and  then  averaged  over 
the  number  of  targets.  These  average  values,  obtained  for 
typical  sample  paths,  are  presented  in  Figure  3.  The  bars  in 
the  figure  mark  the  standard  deviations  of  performance 
over  time  for  each  sample  path.  They  give  an  indication  of 
the  expected  variability  in  performance  for  different  sample 
paths  provided  that  the  dynamics  are  ergodic. 

Average  Fraction  of  Time  the  Target  Error  Goals  are  Attained 
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Figure  3.  As  the  average  number  of  stopped  targets 
increases,  the  average  time  the  error  goals  are  met  for  the 
farsighted  algorithm  is  increasingly  better  than  that  of  the 
myopic  algorithm. 

The  simulation  results  indicate  that  the  farsighted 
sensor  management  algorithm  maintains  better  quality 
target  tracks  than  the  myopic  algorithm.  In  particular,  the 
average  time  the  error  goals  are  met  for  the  farsighted 
algorithm  is  much  longer  than  for  the  myopic  algorithm. 
Furthermore,  the  difference  between  the  average  time  the 
error  goals  are  met  for  the  farsighted  algorithm  and  for  the 
myopic  algorithms  is  increasing  as  the  average  number  of 
stopped  targets  increases.  We  attribute  this  to  the 


capability  of  the  farsighted  algorithm  to  adapt  the  target 
revisit  rates  appropriately. 

In  particular,  the  myopic  algorithm  schedules  the 
longer  FTI  mode  more  frequently  to  observe  the  stopped 
targets,  as  the  number  of  stopped  targets  increases.  This 
results  in  substantially  less  time  being  spent  observing 
moving  targets,  and  the  corresponding  track  errors  far 
exceed  the  goals.  In  contrast  to  the  myopic  algorithm,  the 
farsighted  algorithm  schedules  the  FTI  mode  less 
frequently.  Instead,  it  schedules  shorter  MTI  modes  to 
revisit  stopped  targets  at  an  appropriate  rate  to  determine 
if  they  have  started  moving.  The  resulting  revisit  rates  on 
the  moving  targets  are  faster,  and  the  farsighted  algorithm 
is  able  to  maintain  track  errors  on  the  moving  targets 
below  the  desired  goals  for  longer  periods  of  time. 

We  believe  that  our  simulation  results  are  important 
indicators  that,  for  some  sensor  management  problems, 
farsighted  strategies  are  better  than  myopic  ones.  We 
believe  that  this  is  the  case  for  sensor  management 
problems  with  complex  dynamics  (e.g.,  when  targets  are 
randomly  starting  and  stopping  and/or  sensor  actions 
have  significantly  different  durations).  The  move/stop 
tracking  problem  considered  here  is  one  such.  For  these 
problems,  the  consequences  of  a  single  sensor  action  do 
not  provide  enough  information  about  the  impact  of  the 
action  on  the  future  system  behavior.  Thus,  to  make  good 
decisions,  it  is  important  that  the  sensor  manager 
anticipates  future  consequences  resulting  from  the  sensor 
actions  taken  at  the  present  time.  In  particular,  the 
farsighted,  rollout  sensor  manager  results  in  a  strategy 
that  adaptively  adjusts  the  frequency  with  which  moving 
and  stopped  targets  are  observed  in  a  manner  that  results 
in  better  tracks  than  the  myopic,  entropy-optimizing 
algorithm. 

5  Conclusions 

We  have  developed  a  novel,  computable,  farsighted 
sensor  manager  for  move/stop  tracking  with  a  multi-mode 
sensor.  This  particular  sensor  management  problem  is 
challenging  because  of  the  complex  target  dynamics  and 
the  variable  duration  of  sensor  actions.  We  have 
evaluated  the  farsighted  algorithm  against  a  myopic, 
entropy-optimizing  sensor  management  algorithm.  Our 
simulation  results  indicate  that  the  farsighted  algorithm 
has  promising  behavior.  In  particular,  the  farsighted 
algorithm  results  in  better  quality  tracks  than  the  myopic 
algorithm. 
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Stochastic  Control  Bounds  on  Sensor  Network  Performance 


David  A.  Castanon 


Abstract —  Consider  a  network  of  sensors,  each  of  which  has 
limited  sensing  resources,  which  is  tasked  with  collecting  noisy 
classification  information  on  a  group  of  unknown  objects.  The 
amount  of  resources  required  a  given  sensor  to  measure  an 
object  depends  on  the  specific  sensor-object  geometry.  Sensors 
exchange  collected  information  to  estimate  object  identities  and 
coordinate  which  measurements  to  collect  next.  This  paper 
describes  a  computable  lower  bound  on  the  classification  error 
that  can  be  achieved  by  a  causal  adaptive  sensing  schedule.  This 
bound  is  based  on  a  formulation  of  the  adaptive  sensing  problem 
as  a  partially  observed  stochastic  control  problem.  Expanding 
the  admissible  control  space  of  this  problem  leads  to  a  relaxed 
problem  with  simpler  decision  structure  for  which  the  bounds 
can  be  computed.  The  bound  computations  are  illustrated  for 
several  examples  involving  100  unknown  objects,  and  compared 
with  the  Monte  Carlo  performance  of  specific  adaptive  sensor 
scheduling  algorithms.  Comparisons  with  optimal  scheduling 
algorithms  for  special  cases  illustrate  the  tightness  of  the 
bounds. 

I.  INTRODUCTION 

There  are  many  recent  applications  for  networks  of  sen¬ 
sors,  each  of  which  has  a  given  amount  of  resources,  such  as 
available  power  or  duty  cycle.  Often,  each  sensor  has  mul¬ 
tiple  sensing  modes  that  it  can  use  to  collect  different  types 
of  information;  the  amount  of  resources  required  to  collect 
a  measurement  by  a  sensor  depends  on  the  specific  sensor- 
object  geometry  and  the  mode  used.  The  network  is  tasked 
with  using  its  available  resources  to  obtain  information  on 
a  given  number  of  objects  or  areas.  In  order  to  achieve  the 
best  information  possible,  it  is  important  to  coordinate  the 
allocation  and  scheduling  of  the  different  sensors  and  sensor 
modes  across  objects  of  interest.  Sensors  exchange  collected 
information  to  determine  the  current  state  of  information  on 
objects.  The  adaptive  sensing  problem  consists  of  selecting 
and  scheduling  the  sensor  modes  which  are  applied  to  objects 
of  interest  based  on  the  collected  past  information. 

This  paper  develops  a  model  for  a  class  of  adaptive 
sensing  problems  involving  the  objective  of  classifying  a 
known  number  of  unknown  objects  at  known  locations,  given 
a  fixed  number  of  sensor  with  finite  resources  and  finite 
modes.  We  assume  that  sensor  performance  parameters  are 
time-invariant,  so  that  the  performance  associated  with  a 
sensor  observing  an  object  with  a  given  mode  do  not  depend 
on  the  time  that  the  sensing  activity  occurs.  This  class  of 
problems  arises  in  several  applications,  from  object  classi¬ 
fication  using  multiple  airborne  platforms,  dynamic  search, 
and  fault  inspection  and  isolation  in  manufacturing  systems. 

This  work  was  supported  in  part  by  grants  NSF  DMI-0330171  and 
DARPA  F336 1 5-02-C- 1197 

Dept.  Electrical  &  Computer  Eng.,  Boston  University,  Boston,  MA 
dac@bu . edu 


In  these  applications,  inaccuracies  in  sensor  measurements 
and  variations  in  object  characteristics  result  in  individual 
measurements  that  provide  noisy  estimates  of  object  type 
whose  quality  depends  on  the  specific  mode  used  by  the 
sensor.  In  situations  with  multiple  sensors,  multiple  objects 
and  limited  resources,  this  noisy  information  can  be  used  to 
prioritize  which  objects  to  look  at  next,  from  which  sensor, 
and  to  assign  appropriate  sensor  modes  to  the  objects. 

Because  of  the  uncertain  nature  of  the  underlying  object 
types  and  the  adaptive  nature  of  the  desired  schedules,  adap¬ 
tive  sensing  problems  can  be  formulated  as  partially  observed 
Markov  decision  problems  (POMDP)  [1],  [2],  [10],  [11].  As 
such,  this  class  of  problems  can  be  solved  using  stochastic 
dynamic  programming  [3],  However,  for  large  numbers  of 
objects,  the  required  state  space  is  very  high-dimensional, 
consisting  of  the  conditional  probability  distributions  of  all  of 
the  objects.  This  leads  to  intractable  computational  problems, 
even  with  the  fastest  POMDP  algorithms. 

Sensing  problems  have  been  formulated  previously  as 
dynamic  optimization  problems  with  partial  information.  The 
extensive  literature  in  search  theory  [20]  deals  with  sensor 
management  problems  involving  objects  that  can  be  of  one 
of  two  types  (hidden  or  found)  with  sensors  that  have  only 
a  single  mode.  The  dynamic  hypothesis  testing  problems 
studied  in  [6]  also  have  objects  that  can  be  of  two  types  and 
a  single  sensor  mode,  but  generalize  results  in  search  theory 
to  broader  classes  of  measurements.  More  recently,  there  has 
been  work  [17]  using  Markov  decision  problem  techniques 
for  sensor  management,  particularly  techniques  based  on 
the  solution  of  multiarmed  bandit  problems.  However,  these 
formulations  also  restrict  the  sensors  to  a  single  sensor  with 
a  single  mode,  and  require  an  infinite  horizon,  time-invariant 
formulation. 

Because  of  the  complexity  of  general  adaptive  sensing 
algorithms  with  multiple  sensors  and  modes,  most  practical 
algorithms  are  heuristic  algorithms  based  on  information- 
theoretic  metrics  [5],  To  date,  there  has  been  no  effective 
approach  that  can  characterize  the  achievable  adaptve  sensing 
performance  performance  to  determine  whether  such  heuris¬ 
tic  algorithms  are  performing  well. 

In  this  paper,  we  consider  sensing  problems  involving 
multiple  distributed  sensors  with  multiple  modes  per  sensor. 
This  model  is  an  extension  of  the  model  discussed  in  [7].  We 
show  that  the  resulting  POMDP  models  admit  a  lower  bound 
on  classification  error  performance  based  on  modifying  the 
constraint  structure  to  expand  the  space  of  admissible  strate¬ 
gies.  The  resulting  problem  becomes  a  dynamic  optimization 
problem  subject  to  expected  value  constraints,  a  class  of 
problems  recently  studied  by  in  [24],  We  develop  a  hier- 


102 


archie  al  algorithm  that  exploits  the  structure  of  the  resulting 
relaxed  problem.  This  hierarchical  algorithm  is  based  on 
the  solution  of  single  object  POMDP  problems,  coupled 
with  nondifferentiable  optimization  techniques  based  on  La- 
grangian  relaxation  [16].  The  single  object  problems  are  of 
small  dimension,  and  can  be  readily  solved  using  standard 
algorithms  for  POMDPs  [10],  [11],  [13].  The  hierarchical 
algorithm  avoids  the  exponential  growth  of  the  dimensions 
of  the  resulting  state  space  in  the  POMDP  problem  as  a 
function  of  the  number  of  objects. 

The  paper  includes  several  examples  where  the  lower 
bound  performance  is  computed,  and  compared  with  the 
Monte  Carlo  performance  achieved  by  suboptimal  SM  al¬ 
gorithms.  In  particular,  we  compute  bounds  for  a  special 
problem  for  which  the  optimal  sensing  strategy  is  known, 
and  compare  the  bounds  to  the  optimal  performance  to  show 
how  tight  the  bounds  are. 

II.  PROBLEM  FORMULATION 

In  this  section,  we  develop  a  formulation  of  the  adaptive 
sensing  problem  as  a  partially  observed  Markov  decision 
problem  (POMDP).  Assume  that  there  are  N  objects  of 
interest  in  the  problem,  with  known  locations.  Each  object 
can  belong  to  one  and  only  one  of  K  different  classes,  and 
the  object  identity  does  not  change  over  time.  Let  the  variable 
Xi  £  X  =  {1, . . . ,  A'}  denote  the  true  class  of  object  i.  We 
define  the  complete  (but  unknown)  system  state  as: 

x=(xi  x2  •••  xN)  (1) 

Since  the  identities  do  not  change  over  time,  the  complete 
system  state  is  constant  over  time.  We  assume  that  are 
independent  random  variables  with  values  in  the  finite  space 
X.  Associated  with  each  object  i  is  a  prior  probability  vector 
7r,(0)  which  describes  the  probability  distribution  of  the 
random  variable  X{.  That  is, 

7Ty(0)  =  Prob  {xi  =  j]  (2) 

These  probability  distributions  represent  a  priori  knowledge 
collected  on  each  object. 

To  obtain  information  about  the  state  of  each  object, 
selected  objects  are  examined  with  different  modes  from 
different  sensors.  In  order  to  simplify  the  notation  in  the 
exposition,  we  consider  the  case  of  a  single  sensor  with 
multiple  modes  m  £  {1  We  will  highlight  later 

the  extensions  required  to  incorporate  multiple  sensors.  The 
action  to  use  a  sensor  mode  m  on  object  i  produces  an  ob¬ 
servable  ym  in  a  finite  set  Ym,  with  a  conditional  probability 
distribution  that  depends  only  on  the  object  i,  its  type  x 
and  the  mode  to,  denoted  by  p(ym\i,  Xi,  to).  We  assume 
that  the  observation  outcomes  of  these  sensing  actions  are 
conditionally  independent  of  each  other  given  the  object 
types. 

We  assume  that  obtaining  a  measurement  of  object  i  with 
mode  m  requires  sensor  resources  Ilrm  >  0  (e.g.  power), 
which  depend  on  the  object  location,  sensor  location  and 
specific  mode  selected.  The  sensor  has  a  finite  amount  of  sen¬ 
sor  resources  R  that  can  be  used  for  measuring  objects.  The 


objective  is  to  classify,  with  minimal  error  cost,  the  objects 
after  the  sensor  resource  R  is  exhausted.  This  formulation  is 
stated  below. 

Without  loss  of  generality,  we  restrict  our  attention  to 
sensing  policies  that  execute  only  one  action  at  a  time. 
Such  strategies  are  optimal  in  that  they  provide  maximal 
information  for  adaptation,  and  will  achieve  minimal  error 
cost.  Let  u(k)  =  ( ))  denote  the  k  +  1-th  action 
(starting  at  k  =  0)  taken  by  the  sensor,  consisting  of 
measuring  object  i(k)  with  mode  m(k).  Let  U  denote  the 
set  of  possible  sensor  actions,  and  let  ym(k){k)  denote  the 
measured  value  resulting  from  action  u(k)  £  U.  The  past 
information  available  to  adaptively  select  u(k)  is  I(k)  = 
MO) ,  ym( o)  (0) , . . , ,  u(k  -  1) ,  t/TO(fe_i)  (k  -  1) }.  The  sensing 
problem  decisions  are  selected  adaptively  until  a  final  ran¬ 
dom  stopping  instance  T,  selected  based  on  the  information 
I(T).  At  the  end  of  this  stopping  instance,  the  information 
I{T)  is  available  for  estimating  the  object  types.  For  each 
object  i,  there  is  a  final  decision  ut  £  X  based  on  I(T)  that 
is  selected  to  minimize  the  expected  classification  error. 

An  admissible  adaptive  sensing  policy  is  a  set  of  measur¬ 
able  feedback  policies  {7(0), . . .  ,7(T)}  and  stopping  time 
T  such  that 

7  (fc)  :  I  (k)  — >  U,  k  <T 
T  :  I[T)  — »  {stop,  continue} 

7 (T)  :  I(T)  -►  XN  (3) 

Let  r  denote  the  set  of  all  admissible  sensing  policies.  Since 
the  observation  space  is  finite  and  the  decision  space  is  also 
finite,  T  is  a  countable  space. 

Denote  by  c(v,x)  the  cost  of  selecting  classification  deci¬ 
sion  v  when  the  true  object  type  is  x.  The  adaptive  sensing 
problem  is  to  minimize  the  expected  total  classification  cost 

M 

J(l)  =  E1(^c{vi,xi)}  (4) 

i= 1 

over  adaptive  sensing  policies  7  E  T  satisfying  the  resource 
utilization  constraint 

t-  1 

^  R  (5) 

k- 0 

with  the  notation  R(u(k))  =  Ri(k)m(k)-  Note  that  the  con¬ 
straint  in  (5)  is  a  sample  path  constraint;  for  every  realization 
of  the  information  sets  I(k),  the  adaptive  policy  7  must  not 
exceed  the  total  sensor  resources  available.  Since  the  sets  of 
possible  observation  outcomes  per  mode  Ym  and  possible 
decisions  urn  are  finite,  the  number  of  possible  information 
sets  after  k  —  1  actions  I(k )  is  also  finite.  This  implies 
that  there  is  a  finite  number  of  possible  admissible  sensing 
policies  that  satisfy  the  constraint  (5). 

The  above  problem  is  a  class  of  finite-state,  finite- 
observation  partially  observed  Markov  decision  problems 
studied  in  [1],  [2],  [10],  [11],  with  the  special  structure 
that  the  underlying  state  dynamics  are  trivial,  and  the  pres¬ 
ence  of  the  sample  path  constraints  of  (5).  Such  problem 
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scan  be  transformed  into  fully-observed  Markovian  decision 
problems  in  terms  of  a  sufficient  statistic:  the  conditional 
probability  distribution  of  the  state  x  given  information  7(fc), 
as  follows:  Let  S  C  RK  denote  the  space  of  probabil¬ 
ity  distributions  on  X,  and  let  Sn  denote  the  space  of 
probability  distributions  on  The  conditional  distribution 
vector  for  the  composite  state  x_  given  the  information  7(fc), 
P(x\I(k))  £  SN,  can  be  viewed  as  an  information  state, 
a  sufficient  statistic  summarizing  the  past  observations.  The 
recursive  evolution  of  this  information  state  in  response  to 
an  action  u(k)  =  ( i(k),m(k ))  can  be  described  by  Bayes’ 
rule  as 


P(x\I(k  +  l))  =  P(x\I{k),u(k),ym{k)(k)) 

_  P{ym(k) (k) |xi(fc) ,  m(k))P(x\I(k)) 

P{Vm(k){k))\I{k),u(k)) 

with  the  initial  condition 

N 

p(®i/(o))=n*-i(°)  (7) 

i—l 

Under  the  previous  independence  assumptions,  the  following 
lemma  establishes  a  convenient  representation: 

Lemma  2.1:  Under  the  adaptive  sensing  problem  assump¬ 
tions,  the  conditional  probability  P(x]7(fc))  can  be  factored 
as 

N 

p(*i/(fc))=n^i/(*))  (8) 

i= 1 


where  the  evolution  of  P(xi\I(k))  under  sensing  action 
u(k)  =  ( i(k),m(k ))  and  observation  ymtk)(k)  is  given  by 

P(xi\I(k  +  1))  = 


P{xi\I(k)) 

,  Ef=i  P(ym(.k)(k)\xi=jJ(k))P(xi=j\I(k)) 


if  i(k)  ^  i 
otherwise 


(9) 


The  proof  of  this  lemma  is  straightforward  by  induction,  as 
the  independence  assumption  of  the  object  types  Xi  guaran¬ 
tees  the  Lemma  is  satisfied  at  k  =  0,  and  (6)  establishes 
the  recursion.  Note  also  that  P(xi\I(k))  depends  only  on 
measurements  in  I(k)  corresponding  to  object  i. 

Define  tt,  (k)  to  be  the  conditional  probability  distribution 
of  Xi  given  information  I(k): 

7Tj(fe)  =  P(Xi\I(k))  (10) 

The  vector  7 r* (At)  has  components  7r ij(k)  =  P(xi  =  j\I(k)). 
Lemma  2.1  establishes  that  the  conditional  probability  dis¬ 
tribution  of  the  entire  state,  P(x_\I(k)),  can  be  computed  as 
the  product  of  7 r* (fc) , i  =  1, ...  ,7V.  Define  the  information 
vector  if  =  (nf  ...  njj)  For  a  given  observation  ym 
using  mode  m  on  object  index  i,  define  the  observation 
probability  matrix  as  the  K  x  K  diagonal  matrix 

Bi{ym)  =  d\ag{P{ym\xi  =  1,  to),  ... ,  P(ym\xi  =  K,  m)} 

The  information  vector  evolves  in  response  to  a  measurement 
ym  obtained  from  a  sensing  action  (i,m)  according  to  an 


operator  T,  where 

T(n,u=  (■ i',m),ym ) 

and 


/  Ti(7ri,«  =  (■ i',m),ym ) 
\Tn(TrN ,  u  =  (i1,  m),  ym) 


rpt  C>  \  \  1 

Ti(TTi,U  =  (l  ,m),ym)  =  ■[  Bi(ym)iH  -c  ■  ; 

l  11  *  -  * 

and  e  is  a  /f -dimensional  vector  of  all  ones. 

The  adaptive  sensing  problem  described  above  can  be 
solved  by  stochastic  dynamic  programming  [3],  The  resource 
constraint  in  (5)  can  be  incorporated  into  the  dynamics  to 
obtain  a  dynamic  programming  recursion,  as  follows  [24]. 
Define  a  value  function  V (tt,  C)  to  be  the  optimal  solution 
of  (3)-(5)  when  the  initial  information  is  j r  and  the  available 
sensor  resource  level  is  R  =  C.  The  value  function  V 
is  thus  defined  on  SN  x  R+.  The  dynamic  programming 
problem  is  stated  as  a  total  cost  problem  with  nonnegative 
costs,  for  which  the  optimal  value  function  satisfies  the 
following  Bellman’s  equation:  Let  (7(7?)  C  U  denote  the 
set  of  feasible  sensor  actions  (i,  m)  such  that  Rim  <  R. 
At  each  decision  stage,  there  is  a  choice  of  stopping  and 
classifying  the  objects  with  the  available  information,  or 
taking  additional  measurements.  The  optimal  value  function 
satisfies  the  Bellman  equation 


JV 

V (ff,  R)  =  min  min  ^  c(vi,j)-Kij , 

i=l  Vt  j=l,...,K 

min  Eym,{V(T( n,  (i' ,m'),ym>),  R  -  7W')}] 
u=(i' ,m')£U  (R) 

(ii) 


where 


Eym,{V(T(n,  (i',m'),ym>),R-  Ri'm’)}  = 

Y.  P{ym' \I(k),u)V(T(n,  u,  ym'),R  —  Rem') 

ym'£Ym, 

^  ^  6  Pi'  (.ym'  )7TzU(7’(7T,  tt,  ym')i  R  Ri'm') 

Vm'£  Ym' 

(12) 

This  recursion  starts  from  the  following  boundary  conditions: 
Let  Rmin  =  min,jm  Rim.  Then,  the  set  of  admissible  modes 
(7(7?)  is  empty  for  7?  <  Rmin.  Thus, 

N 

V (n,  7?)  =  Y  min  V'  c(vi,j)-Kij  if  7?  <  7?min 

z '  Vi£X  * — ' 
i—l  7=1,...,  K 

(13) 

Eqs.  (1 1)-(13)  can  be  used  recursively  to  compute  the  optimal 
value  for  all  information  states  and  nonnegative  levels. 

The  initialization  of  the  recursion  decouples  into  N  inde¬ 
pendent  optimizations,  as  there  are  no  coupling  constraints 
on  the  decisions  Vi,  and  the  local  decision  costs  c(v-i,Xi ) 
depend  only  on  the  marginal  probability  distributions  of 
each  object’s  type.  However,  the  recursion  (11)  does  not 
preserve  this  decomposability.  The  coupling  arises  primarily 
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because  of  the  resource  use  constraints  in  (5);  the  decision 
of  which  object  to  view  and  which  mode  to  use  depends  on 
the  information  vector  of  all  the  objects  and  the  available 
resources.  Thus,  the  dynamic  programming  induction  must 
be  carried  out  for  the  entire  state  j f(f),  which  becomes  a 
formidable  computation  problem  even  for  moderate  numbers 
of  objects. 


III.  RELAXED  FORMULATION  AND  LOWER 
BOUNDS 

To  obtain  a  simpler  dynamic  programming  formulation, 
we  relax  the  sample  path  sensor  resource  use  constraints  (5) 
and  use  an  averaged  version  of  the  same  constraints,  as 

T 

E{^2R(u(k))}  <R  (14) 

fc= l 

This  approach  replaces  a  large  set  of  constraints  (one  per 
sample  path)  by  a  single  aggregate  constraint.  Note  that  any 
adaptive  sensing  policies  that  satisfy  (5)  will  also  satisfy  (14). 
Thus,  this  approach  increases  the  set  of  admissible  policies. 
Let  J*  denote  the  optimal  classification  cost  of  the  problem 
in  (3)-(4)  with  constraints  (5).  Let  JA  denote  the  optimal 
classification  cost  of  the  problem  in  (3)-(4)  with  constraints 

(14) ,  denoted  as  the  relaxed  adaptive  sensing  problem. 
Lemma  3.1:  J*  >  J*A 

The  relaxed  problem  has  a  single  coupling  constraint  (for 
one  sensor)  relating  the  sensing  actions  on  different  objects. 
Let  A  >  0  denote  a  Lagrange  multiplier.  For  any  admissible 
policies  in  T,  consider  the  objective 

N  T—l 

J(A,7)  =  £7{£  c(vi,Xi)}  +  A[.E7{^  R(u(k))}  —  R] 

i—1  k— 0 

(15) 

Consider  the  unconstrained  adaptive  sensing  problem  of  find¬ 
ing  policies  7  and  an  adaptive  stopping  time  T  to  minimize 

(15) .  If  (7,  T)  is  an  adaptive  SM  policy  that  satisfies  (14), 
the  second  term  in  (15)  is  nonpositive.  Denote  by  J*( A) 
the  optimal  value  of  (15)  over  all  adaptive  sensing  policies 
7  £  1’.  Then, 

Lemma  3.2:  For  all  values  of  A  >  0, 

J*>J*A>J*{  A)  (16) 

In  particular, 

J*  >  sup  J*  (A)  (17) 

A>0 


Lemma  3.2  is  a  consequence  of  weak  duality  in  nonlinear 
programming  [4].  Note  that  the  number  of  adaptive  sensing 
policies  that  satisfy  (15)  is  finite,  because  the  set  of  possible 
histories  I(k)  is  finite  for  all  k.  Thus,  computation  of  JA 
is  an  integer  programming  problem,  and  computation  of 
supA>0  J*(A)  is  its  dual  problem. 

The  key  issue  is  whether  the  lower  bounds  J*( A)  can  be 
computed  efficiently.  Rewrite  (15)  for  7  £  T  as 

N  T—l 

J{ A,  7)  =  +  A  ^2  R(u(k) )<5i(fe)_i]}  -  A R 


where  the  indicator  function  Si  =  1  if  *  =  0,  and  0  otherwise. 
This  suggests  that  optimization  of  J( A,  7)  may  be  separable 
across  individual  objects  i. 

Partition  the  information  I(k)  into  disjoint  sets  Ifk), 
where  Ifk)  are  the  sensing  actions  and  measurement  actions 
applied  to  object  i: 


h{k)  =  {(u(j),y(j))\j  <  k,i(j)  =  i}  (19) 

Note  that  the  conditional  probability  vector  7 17  only  changes 
on  measurements  included  in  Ii(k).  We  wish  to  restrict  the 
set  of  adaptive  sensing  policies  to  a  subset  where  the  decision 
to  apply  a  sensor  action  for  object  i  depends  only  on  the 
information  previously  collected  for  object  i.  We  refer  to  this 
subset  of  policies  as  adaptive  local  sensing  policies,  defined 
as: 

Definition  3.1:  An  adaptive  local  sensing  policy  is  an 
adaptive  sensing  policy  7  and  stopping  times  T),*  = 
l,...,iV,  with  the  properties  that,  for  each  sensing  action 
instance  k, 

1)  If  u(k)  =  ( i(k),m(k )),  then  i(k)  =  k  mod  N  +  1. 

2)  The  selected  sensor  mode  m(k)  depends  only  on  the 
information  I ^ . 

3)  For  each  object  i,  there  is  a  stopping  time  7)  which 
depends  only  on  Ii(Tf)  such  that,  for  all  /::  >  7),  if  i  = 
k  mod  N  +  1,  no  sensing  action  is  taken.  If  k  <  T, 
and  i  =  k  mod  N  +  1,  then  u(k)  =  (i,  m )  for  some 
mode  m  in  {1,  . . . ,  M}. 

4)  At  time  X),  the  local  decision  v,  for  object  i  is  selected 
as  a  function  of  Ij(Tj). 

Adaptive  local  sensing  policies  use  a  fixed  round-robin 
schedule  for  selecting  which  objects  to  measure.  Further¬ 
more,  the  choice  of  sensing  mode  for  each  action  on  object 
i  depends  only  on  the  prior  information  collected  on  that 
object.  In  addition,  there  is  an  independent  stopping  time 
for  each  object  i  such  that  a  final  classification  decision  is 
made  on  object  i,  based  only  on  prior  information  collected 
on  that  object.  Note  that  there  are  decision  instances  k 
where  no  sensing  action  is  taken,  when  k  >  Ti  and  i  =  k 
mod  N  +  1;  these  instances  correspond  to  times  after  a 
final  decision  has  been  selected  for  object  i.  The  effective 
stopping  time  of  an  adaptive  local  sensing  policy  is  defined 
as  T  =  max.;=i  7),  and  is  the  earliest  time  at  which 
every  object  has  a  final  classification  decision.  Thus,  adaptive 
local  sensing  policies  can  be  viewed  as  a  subset  of  the  class 
of  adaptive  sensing  policies. 

Let  T^  denote  the  set  of  adaptive  local  sensing  policies. 
For  a  given  amount  of  sensor  resources  R,  there  are  a  finite 
number  of  feasible  adaptive  local  sensing  policies.  In  general, 
Ti  is  a  countable  discrete  set.  For  the  purposes  of  bound 
computation,  we  will  expand  T  ^  to  include  mixed  policies, 
consisting  of  probabilistic  mixtures  of  policies  in  1/, : 

Definition  3.2:  A  mixed  local  sensing  policy  is  a  proba¬ 
bility  distribution  17(7)  over  T l  such  that  local  SM  policy  7 
is  selected  for  use  with  probability  The  set  of  mixed 
local  sensing  policies  is  denoted  by  Q(rs). 
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Consider  the  problem  of  minimizing  the  relaxed  cost  (18) 
over  local  sensing  policies  T  l.  Since  C  T,  we  have 

min  J(A,7)  <  min  J( A,  7)  (20) 

Furthermore,  since  (18)  is  an  unconstrained  objective,  the 
minimum  in  mixed  local  sensing  policies  is  achieved  by  a 
pure  local  sensing  policy,  so 

min  J(A,  7)  <  min  V  g(7)J(A,7)  (21) 

7er  qeQ{TL)^ 

The  importance  of  mixed  local  sensing  policies  is  highlighted 
in  the  theorem  below,  proven  in  [25]: 

Theorem  3.1:  For  any  admissible  adaptive  sensing  policy 
7  £  r,  there  exists  a  mixed  local  sensing  policy  q  £  Q(TS) 
such  that  the  expected  classification  costs  in  (4)  and  the 
expected  total  resource  use  in  (14)  are  equal  under  both 
policies  7  and  q. 

This  result  implies  the  following  inequality: 

min  J(A,  7)  >  min  V  q(i)J(\i)  (22) 

7er  qeQ{TL)^ 

Combining  (21)  and  (22)  yields  the  following: 

min  J(A,7)  =  min  V  g(7)J(A,7)  =  min  J( A,  7) 
76T  geQ( rL)  ^—3  76IT 

L 

(23) 

Eq.  (23)  implies  that  lower  bounds  for  the  achievable 
classification  performance  can  be  computed  by  optimizing 
over  local  sensing  policies  only.  For  each  local  policy  7  £ 
T/,,  let  7,  denote  the  policy  that  is  used  for  instances  k  when 

actions  are  taken  for  object  i ,  and  let  1  be  the  set  of  such 

admissible  local  policies  for  object  i.  Thus,  7,  selects  actions 
for  object  i  based  on  past  observations  ij(fc),  and  selects  a 
stopping  time  T,  and  a  final  classification  c,  at  that  stopping 
time.  The  importance  of  local  sensing  policies  is  that  the 
optimization  in  (23)  decouples  over  objects  as 

min  J( A,  7)  =  min  Ji(\,Ti)  —  Af?  (24) 

7er7  '  7i6rv. 

I 

where 

T- 1 

^(7  7t)  -^7 i  ,  Xi)  T  A  ^  '  R(u(k)')Sj/^— (25) 

fe= 0 

This  implies  that  computation  of  the  bounds  can  be  achieved 
with  N  independent  optimization  problems  for  each  value  of 
A.  Furthermore,  the  optimal  bound  can  be  computeed  as  in 
Lemma  3.2,  as 

J*  >  sup  min  J(A,7)  (26) 

A>076ri, 

Note  that  the  right  hand  side  of  (26)  is  the  dual  of  the 
following  linear  programming  problem: 

min  9(7)^7J(7)  (27) 

qeQ{rL)  “ 


subject  to 

R(u(k))}  <  R  (28) 

7£r.L  fc=0T_1 

E  7(7)  =  1  (29) 

7  eiY 

which  is  a  linear  program  over  the  choice  of  probability 
distributions  q  £  Q(Tl).  This  can  be  exploited  to  solve 
efficiently  for  the  bound.  Specifically,  note  that  this  is  a  linear 
program  subject  to  two  constraints,  which  implies  that  the 
optimal  mixed  local  sensing  policy  q  will  have  support  only 
on  two  pure  local  sensing  policies.  This  property  will  be 
exploited  in  the  next  section  for  bound  computation. 

IV.  BOUND  ALGORITHMS 

There  are  two  potential  approaches  to  compute  a  lower 
bound:  a  dual  approach,  based  on  Lagrangian  relaxation 
[16],  that  optimizes  (26)  over  the  choice  of  dual  variable  A, 
and  a  primal  approach  based  on  solving  the  linear  program 
(27)-(29).  The  dual  approach  is  straightforward,  and  uses 
techniques  from  nondifferentiable  optimization  [19]  to  search 
the  space  of  possible  A.  The  primal  approach  is  harder, 
because  the  optimization  is  over  a  large  space  of  possible 
values  of  mixture  probabilities  q.  However,  this  mixture  has 
very  sparse  support,  which  makes  it  suitable  for  column 
generation  algorithms  [18], 

A  fundamental  step  in  either  approach  is  the  computation 
of  the  optimal  local  sensing  policies  for  a  fixed  value  of  A  for 
each  object.  For  object  i,  one  must  solve  the  local  problem 
given  A: 

Ti- 1 

min  [c(ttj,  x,)  +  A  ^  R(u(k))}  (30) 

k>0:i—kmodN-\-l 

This  problem  is  a  multi-stage  single  object  POMDP,  with 
sufficient  statistic  given  by  the  marginal  probability  distri¬ 
bution  ni(k).  One  can  reduce  the  action  instants  to  a  new 
counter  k’  indexing  only  the  action  opportunities  for  object 
i ,  to  obtain 

T.-l 

min  E7i[c(vi,Xi)  +  A  V'  Rim(k')]  (31) 

7*er^i  7-7 

k' 

The  resulting  POMDP  problems  are  small  enough  to  solve 
using  existing  algorithms  such  as  those  overviewed  in  [2], 
[10],  [13],  [14], 

Solution  of  the  N  decoupled  problems  (31)  yields  a 
local  policy  7  £  for  which  the  expected  classifi¬ 

cation  cost  E,\EL  c(G;  xi)\  and  expected  resource  use 
T^yEfcLo1  R(u(k))\  are  computed  from  the  solution.  This 
provides  the  starting  point  for  the  use  of  column  generation 
[18]  for  solution  of  (27)-(29).  Column  generation  was  used 
by  Yost  [21],  [22],  [23]  in  his  work  on  POMDPs  for  resource 
assignment  and  was  also  exploited  in  [8]  for  the  solution  of 
stochastic  weapon  assignment  problems. 

The  algorithm  starts  with  an  initial  set  of  pure  local 
sensing  policies  'yd  indexed  by  d  =  1, ....  I).  with  known  ex¬ 
pected  classification  performance  Jd  and  expected  resource 
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use  Rd.  The  first  step  in  the  algorithm  is  to  solve  the 
linear  program  in  (27)-(29)  restricted  to  mixtures  of  the  d  = 
1, . . . ,  D  initial  policies.  Since  the  support  of  the  admissible 
mixed  policies  is  restricted,  the  solution  provides  an  upper 
bound  JUB  to  the  optimal  cost.  Denote  by  Ad  the  optimal 
dual  price  of  the  resource  constraint  (29)  in  this  solution. 
The  constraint  generation  algorithm  uses  this  optimal  dual 
price  value  in  (31)  to  generate  a  new  candidate  local  policy 
7,)+1  .bv  solving  N  independent  POMDP  problems  with  this 
value  of  A.  The  combined  solution  of  the  N  subproblems  also 
provides  a  lower  bound  JLB  on  the  optimal  performance,  as 
described  in  Lemma  3.2.  The  key  result  in  the  constraint 
generation  algorithm  is  stated  as  follows  [18]: 

Lemma  4.1:  Consider  the  pure  local  policy  generated  by 
the  solution  of  (31).  If  JLB  =  JUB,  the  optimal  solution 
over  all  mixtures  of  local  policies  is  a  mixture  of  the  local 
policies  indexed  by  d  =  1, . . . ,  D.  Otherwise,  the  pure  local 
policy  'yD+1  can  be  used  as  part  of  a  mixed  policy  which 
provides  a  cost  lower  than  JUB . 

V.  EXTENSION  TO  MULTIPLE  SENSORS 

The  development  of  the  previous  sections  carries  through 
with  little  modification  when  multiple  sensors  are  used.  The 
key  difference  is  that  there  is  a  separate  resource  constraint 
for  each  sensor.  Thus,  there  will  be  a  vector  of  sensor 
resources  Rs,  where  s  is  a  sensor  index,  thus  resulting  in  a 
vector  of  averaged  constraints  (14).  The  Lagrange  multipliers 
A  will  thus  be  vectors  instead  of  scalars.  Nevertheless,  all  of 
the  lemmas  and  theorems  can  be  extended  to  the  multisensor 
case  with  minor  modifications. 

The  main  assumption  that  was  used  in  the  single  sensor 
formulation  was  that  only  one  sensor  action  would  be  per¬ 
formed  simultaneously.  This  assumption  is  still  used  for  the 
multiple  sensor  problem  to  derive  the  lower  bound,  although 
the  results  in  the  previous  section  indicate  that  optimal  local 
sensing  strategies  that  achieve  the  lower  bound  may  use 
simultaneous  sensing  by  multiple  sensors. 

The  column  generation  algorithm  discussed  in  the  previous 
section  extends  naturally  to  multiple  sensors.  When  there 
are  L  sensors,  the  optimal  mixed  local  SM  policies  will  be 
mixtures  of  L  +  1  pure  local  SM  policies.  Nondifferentiable 
optimization  algorithms  that  maximize  the  dual  cost  can  also 
be  used  in  this  case. 

VI.  EXAMPLES 

For  our  first  example,  we  consider  a  case  where  the 
optimal  strategies  are  known  [26].  In  this  example,  there 
are  100  unknown  objects  with  one  of  two  types,  with  equal 
priors  for  each  object.  There  is  a  single  sensor  that  has 
a  single  measurement  mode,  and  the  problem  is  optimal 
adaptive  allocation  of  a  fixed  number  of  measurements  over 
the  number  of  objects.  Measurement  outcomes  are  binary¬ 
valued,  identifying  one  of  the  two  types,  and  a  single  mea¬ 
surement  has  a  probability  of  error  Pe,  which  is  symmetric 
over  type.  The  objective  is  to  minimize  the  expected  number 
of  classification  errors  after  N  measurements.  The  optimal 


strategy  derived  in  [26]  is  to  assign  the  next  measurement  to 
the  object  with  conditional  probability. 

Table  I  shows  the  results  of  1000  Monte  Carlo  simula¬ 
tions  of  the  optimal  strategy,  compared  with  the  predicted 
performance  of  the  lower  bound,  in  terms  of  expected 
number  of  classification  errors  for  3  different  conditions 
of  symmetric  single  measurement  Pe  and  four  levels  of 
number  of  measurements  N.  As  the  table  indicates,  the 
bound  predictions  are  very  tight  for  this  case.  The  gap 
between  gap  and  optimal  strategy  increases  as  the  number 
of  measurements  N  increases  because  likelihood  of  errors 
decreases,  and  the  bound  strategy  allows  the  use  of  more 
resources  than  available  in  the  unlikely  cases  that  lead  to 
errors. 


N 

Pe  = 

Bound 

0.25 

Opt. 

Pe 

Bound 

=  0.2 

Opt. 

Pe  = 

Bound 

0.15 

Opt. 

100 

25 

25.03 

20 

20.02 

15 

15.067 

200 

18.182 

18.185 

12.727 

12.765 

7.888 

7.988 

300 

11.364 

11.432 

5.749 

6.038 

2.518 

2.593 

400 

7.833 

7.905 

3.468 

3.543 

0.927 

0.987 

TABLE  I 

Comparison  of  expected  number  of  errors  by  Lower  Bound 
and  Monte  Carlo  of  optimal  strategy 

For  the  second  set  of  experiments,  we  consider  a  different 
100  object  scenario  where  objects  can  be  of  three  different 
types  (K  =  3):  cars,  trucks  and  military  vehicles  (MV). 
There  is  a  single  sensor,  with  two  modes:  a  low  resolution 
mode  1  that  takes  1  second  per  image  (Rn  =  1),  and  a 
high  resolution  mode  2  that  requires  5  seconds  per  image, 
( I {,2  =  5).  Low  resolution  imagery  is  useful  in  separating 
cars  from  trucks  and  MVs,  but  separating  trucks  from  MVs 
requires  high  resolution  imagery.  Apriori,  each  object  has  a 
probability  of  0.1  as  a  military  vehicle,  0.2  truck  and  0.7  car. 
Imagery  generated  by  the  sensor  is  processed  into  a  binary 
decision  as  to  whether  the  object  is  MV  or  not.  Hence  yij  £ 
{0, 1},  where  1  indicates  that  the  decision  is  MV. 

The  objective  of  the  problem  is  to  determine  as  accurately 
as  possible  which  objects  are  military  vehicles  (type  1).  Thus, 
the  classification  costs  are  given  by  d(vi,  Xi )  as  a  3  x  3  matrix 
where  u,  is  the  row  index: 

/  0  MI)  MD\ 

{d{vi,Xi))=  \FA  0  0  (32) 

\FA  0  0  / 

where  FA  =  1  and  A/ 1)  will  vary  from  1  to  80  in  the 
experiments. 


1  low-resolution 

1  high-resolution  1 

Type 

y  =  o 

y  =  i 

y  =  o 

y  =  i 

Car 

0.9 

0.1 

0.95 

0.05 

Truck 

0.1 

0.9 

0.85 

0.15 

MV 

0.1 

0.9 

0.8 

0.2 

TABLE  II 

Measurement  likelihoods  for  different  modes 

The  conditional  probability  distributions  p(y\x,m)  are 
given  in  Table  II.  In  terms  of  constraints,  we  assume  that 
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there  is  a  single  resource  pool  of  R  seconds  to  be  used 
before  all  objects  need  to  be  classified.  This  number  will  also 
be  varied  across  the  experiments  from  300  seconds  to  700 
seconds,  to  evaluate  the  bounds  and  algorithm  performance 
for  scenarios  where  the  amount  of  sensor  resources  ranges 
from  poor  to  rich. 

In  order  to  evaluate  the  utility  of  the  lower  bound,  we 
compare  the  bound  with  the  performance  of  two  adaptive  SM 
algorithms:  a  variation  of  Kastella’s  discrimination  gain  (DG) 
algorithm  [5],  which  is  a  sequential  algorithm  for  selecting 
the  best  sensor  mode  and  target  on  the  basis  of  maximizing 
the  expected  entropy  reduction  in  the  distribution  of  object 
type  per  unit  sensor  resource  applied,  and  a  dynamic  schedul¬ 
ing  algorithm  (ADP)  based  on  Lagrangian  relaxation  and 
POMDP  approximations  described  in  [7]. 

Each  algorithm  was  simulated  for  100  independent  Monte 
Carlo  runs  using  the  same  measurement  outcomes  to  evaluate 
its  average  performance  for  three  different  levels  of  sensor 
resources:  300  seconds,  500  seconds  and  700  seconds.  The 
expected  cost  results  were  compared  with  the  predictions 
of  the  lower  bound.  Table  III  includes  the  results  for  300 
seconds  and  700  seconds  of  resources  for  six  levels  of  missed 
detection  (MD)  costs.  In  this  more  complex  case,  the  bound 
shows  that  there  is  room  for  improvement  in  both  of  the 
algorithms,  although  the  performance  of  the  algorithms  is 
close  to  optimal  for  some  of  the  conditions.  For  instance, 
when  MD  is  close  to  1,  the  costs  of  missed  detections 
and  false  alarms  is  close,  and  policies  such  as  maximizing 
information  gain  as  measured  by  entropy  are  near-optimal. 
Similarly,  the  performance  of  the  ADP  is  closer  to  the  lower 
bound  for  limited  sensor  resources,  as  the  limited  look  ahead 
approximation  is  closer  to  the  actual  optimal  number  of 
sensor  actions  per  object. 


700  Seconds 

300  Seconds 

MD 

Bound 

ADP 

DG 

Bound 

ADP 

Greedy 

1 

1.6 

1.58 

1.91 

4.61 

9.7 

9.17 

5 

4.5 

4.46 

6.75 

15.66 

17.03 

18.62 

10 

6.5 

6.49 

9.87 

19.56 

21.18 

20.71 

20 

8 

8.25 

14.87 

21.67 

22.38 

22.11 

40 

10 

10.01 

23.05 

24.18 

24.53 

24.91 

80 

11.25 

14.6 

29.85 

26.38 

26.38 

30.5 

TABLE  III 

Performance  of  Scheduling  Algorithms  vs.  Bound 


Figure  1  shows  the  results  for  the  two  algorithms  and  the 
lower  bound  for  500  seconds  of  sensing  resource  time.  The 
results  show  that  there  is  significant  room  for  improvement  in 
both  policies:  the  discrimination  gain  (DG)algorithm  fails  to 
incorporate  the  relative  values  of  different  types  of  errors  in 
its  information  seeking  policy,  and  the  ADP  is  conservative 
in  that  it  does  not  use  mixed  policies  and  uses  a  limited 
lookahead,  and  thus  can  underutilize  sensor  resources. 

VII.  DISCUSSION 

In  this  paper,  we  have  presented  a  lower  bound  for 
the  achievable  classification  performance  for  a  network  of 
sensors  with  finite  sensing  resources.  The  approach  is  based 


Fig.  1.  Monte  Carlo  performance  of  algorithms  and  lower  bound  for  500 
seconds  of  sensor  resource. 

on  formulation  of  the  adaptive  sensing  problem  as  a  par¬ 
tially  observed  Markovian  decision  problem,  which  is  then 
approximated  by  expanding  the  admissible  decision  space. 
This  approximate  formulation  can  be  posed  as  an  integer 
programming  problem  that  has  a  separable  dual  formulation. 
A  key  result  in  establishing  this  separability  is  to  show  that 
the  lower  bound  formulation  can  be  solved  in  terms  of  a 
subset  of  sensing  policies  known  as  mixed  local  sensing 
policies,  which  are  random  mixtures  of  policies  that  select 
actions  on  each  object  based  only  on  the  past  information 
collected  on  that  object. 

We  presented  experimental  results  that  compared  the  lower 
bound  with  the  performance  of  two  suboptimal  adaptive 
sensing  algorithms  available  in  the  literature.  The  exper¬ 
imental  results  established  that  the  performance  of  both 
algorithms  can  be  improved  substantially  in  order  to  achieve 
the  lower  bound,  and  that  the  bound  is  tight  in  that  the 
performance  of  the  suboptimal  algorithms  is  close  to  the 
predicted  performance  of  the  bound  for  several  conditions. 

In  terms  of  sensor  networks,  the  bound  in  this  paper 
neglects  the  cost  of  communications  as  compared  to  the 
cost  of  active  sensing.  This  is  the  case  when  sensors  are 
in  near  vicinity  of  each  other,  and  sensing  requires  active 
emissions  by  the  sensors,  so  that  the  two-directional  path 
loss  is  significant.  In  situations  where  communications  also 
consume  significant  number  of  resources,  the  bound  is  op¬ 
timistic,  and  would  not  be  a  good  prediction  for  sensor 
network  performance. 
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